Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints

by   Adrian Spurr, et al.
ETH Zurich

Estimating 3D hand pose from 2D images is a difficult, inverse problem due to the inherent scale and depth ambiguities. Current state-of-the-art methods train fully supervised deep neural networks with 3D ground-truth data. However, acquiring 3D annotations is expensive, typically requiring calibrated multi-view setups or labor intensive manual annotations. While annotations of 2D keypoints are much easier to obtain, how to efficiently leverage such weakly-supervised data to improve the task of 3D hand pose prediction remains an important open question. The key difficulty stems from the fact that direct application of additional 2D supervision mostly benefits the 2D proxy objective but does little to alleviate the depth and scale ambiguities. Embracing this challenge we propose a set of novel losses. We show by extensive experiments that our proposed constraints significantly reduce the depth ambiguity and allow the network to more effectively leverage additional 2D annotated images. For example, on the challenging freiHAND dataset using additional 2D annotation without our proposed biomechanical constraints reduces the depth error by only 15%, whereas the error is reduced significantly by 50% when the proposed biomechanical constraints are used.



There are no comments yet.


page 1

page 2

page 3

page 4


Distill Knowledge from NRSfM for Weakly Supervised 3D Pose Learning

We propose to learn a 3D pose estimator by distilling knowledge from Non...

Lifting 2d Human Pose to 3d : A Weakly Supervised Approach

Estimating 3d human pose from monocular images is a challenging problem ...

Generative Model-Based Loss to the Rescue: A Method to Overcome Annotation Errors for Depth-Based Hand Pose Estimation

We propose to use a model-based generative loss for training hand pose e...

Model-based 3D Hand Reconstruction via Self-Supervised Learning

Reconstructing a 3D hand from a single-view RGB image is challenging due...

Hand Pose Estimation through Semi-Supervised and Weakly-Supervised Learning

We propose a method for hand pose estimation based on a deep regressor t...

DeepHandMesh: A Weakly-supervised Deep Encoder-Decoder Framework for High-fidelity Hand Mesh Modeling

Human hands play a central role in interacting with other people and obj...

Adaptive Wasserstein Hourglass for Weakly Supervised Hand Pose Estimation from Monocular RGB

Insufficient labeled training datasets is one of the bottlenecks of 3D h...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Vision-based reconstruction of the 3D pose of human hands is a difficult problem that has applications in many domains. Given that RGB sensors are ubiquitous, recent work has focused on estimating the full 3D pose [20, 14, 12, 4, 26] and dense surface [9, 3, 7] of human hands from 2D imagery alone. This task is challenging due to the dexterity of the human hand, self-occlusions, varying lighting conditions and interactions with objects. Moreover, any given 2D point in the image plane can correspond to multiple 3D points in world space, all of which project onto that same 2D point. This makes 3D handpose estimation from monocular imagery an ill-posed inverse problem in which depth and the resulting scale ambiguity pose a significant difficulty.

Most of the recent methods use deep neural networks for hand pose estimation and rely on a combination of fully labeled real and synthetic training data (e.g., [30, 20, 14, 9, 28, 2, 12, 4, 9]). However, acquiring full 3D annotations for real images is very difficult as it requires complex multi-view setups and labour intensive manual annotations of 2D keypoints in all views [8, 31, 27]. On the other hand, synthetic data does not generalize well to realistic scenarios due to domain discrepancies. Some works attempt to alleviate this by leveraging additional 2D annotated images [12, 3]. Such kind of weakly-supervised data is far easier to acquire for real images as compared to full 3D annotations. These methods use these annotations in a straightforward way in the form of a reprojection loss [3] or supervision for the 2D component only [12]. However, we find that the improvements stemming from including the weakly-supervised data in such a manner are mainly a result of 3D poses that agree with the 2D projection. Yet, the uncertainties arising due to depth ambiguities remain largely unaddressed and the resulting 3D poses can still be implausible. Therefore, these methods still rely on large amounts of fully annotated training data to reduce these ambiguities. In contrast, our goal is to minimize the requirement of 3D annotated data as much as possible and maximize the utility of weakly-labeled real data.

To this end, we propose a set of biomechanically inspired constraints which can be integrated in the neural network training procedure to enable anatomically plausible 3D hand poses even for data with 2D supervision only. Our key insight is that the human hand is subject to a set of limitations imposed by its biomechanics. We model these limitations in a differentiable manner as a set of soft constraints. Note that this is a challenging problem. While bone length constraints have been used successfully [22, 29]

, capturing other biomechanical aspects is more difficult. This is due to the fact that 3D keypoints are predicted independently, despite there being a clear dependence between them. For example the structure of the palm imposes a clear restriction on the relative positions of the joints whereas the individual joint angle limits rely upon each other. Hence, capturing these inter-dependencies when modeling anatomical limitations is crucial. We propose to encode these into a set of losses that is fully differentiable, interpretable and which can be incorporated into any deep learning architecture that predicts 3D joint configurations. More specifically, our set of soft constraints consists of three equations that define

i) the range of valid bone lengths, ii) the range of valid palm structure, and iii) the range of valid joint angles of the thumb and fingers. The main advantage of our set of constraints is that all parameters are interpretable and can either be set manually, opening up the possibility of personalization, or be obtained from a small set of data points for which 3D labels are available. As the backbone model, we leverage a variant of the 2.5D representation proposed by Iqbal et al[12] due to its superior performance. We identify an issue in absolute depth calculation and remedy it via a novel refinement network. In summary, we contribute:

  • A novel set of differentiable soft constraints inspired by the biomechanical structure of the human hand.

  • Quantitative and qualitative evidence that demonstrates that our proposed set of constraints improves 3D prediction accuracy in weakly supervised settings, resulting in an improvement of as opposed to as yielded by straightforward use of weakly-supervised data.

  • A neural network architecture that extends [12] with a refinement step.

  • Achieving state-of-the-art performance on Dexter+Object using only synthetic and weakly-supervised real data, indicating cross-data generalizability.

The propose constraints require no special data nor are they specific to a particular backbone architecture.

2 Related work

Hand pose estimation from monocular RGB images has gained a lot of traction in recent years due a large number of possible applications. Earlier methods in this area relied on optimization-based method to fit a deformable hand model to image observations [11, 24, 18, 25, 15] or adopted search-based techniques to retrieve nearest neighbours from large databases [1, 17]. However, most of the recent methods utilize deep neural networks and use annotated training data to train hand pose estimation models. These models either directly regress the 3D positions of the hand keypoints [30, 20, 26, 14, 12, 23] or predict the parameters of a deformable model of the hand from which the 3D positions can be obtained using Forward Kinematics [2, 3, 28, 9, 25].

Zimmermann et al[30] are the first to use deep neural network for 3D hand pose estimation from RGB images. They introduce a multi-stage approach that regresses root-relative 3D position from an RGB image as input. Subsequently, Spurr et al.[20] learn a unified latent space that projects multiple modalities into the same space, learning a lower level embedding of the hands. Similarly, Yang et al.[26] learn a latent space that disentangles background, camera and hand pose. However, all these methods require large numbers of fully labeled training data. Cai et al.[4] try to alleviate this problem by introducing an approach that utilizes paired RGB-D images to regularize the depth predictions. Mueller et al.[14] try to improve the quality of synthetic training data by learning a GAN model that minimizes the discrepancies between real and synthetic images. Iqbal et al[12] decompose the task into learning 2D and relative depth components. This decomposition allows to use weakly-labeled real images with only 2D pose annotations which are very cheap to acquire. While they demonstrate better generalization by adding a large number weakly-labeled training samples, the main drawback of this approach is that the depth ambiguities remain unaddressed. In particular, training using only 2D pose annotations does not impact the depth predictions. This may result in 3D poses with accurate 2D projections, but due to scale and depth ambiguities the 3D poses can still be implausible. In contrast, in this work, we propose a set of biomechanical constraints that ensures that the predicted 3D poses are always anatomically plausible during training (see Fig. 1

). We formulate these constraints in form of fully-differentiable loss functions which can be incorporated into any deep learning architecture that predicts 3D joint configurations. In our experiments we demonstrate that our proposed set of constraints can be effectively used to train models in the weakly supervised settings. We use a variant of Iqbal

et al[12] as baseline and demonstrate that the requirement of fully labeled real images can be significantly minimized while still maintaining performance on par with the fully-supervised methods.

More recently, many methods predict the parameters of a deformable hand model, e.g., MANO [18], from RGB images [3, 28, 9, 25]. The predicted parameters consist of the shape and pose deformations w.r.t a mean shape and pose that are learned using large amounts of 3D scans of the hand. Alternatively, Ge et al.[7] circumvent the need for a parametric hand model by directly predicting the mesh vertices from RGB images. Since these methods require both shape and pose annotations for training, obtaining such kind of training data is even harder, therefore, most methods rely on synthetic training data. The methods [3, 28, 2] try to alleviate this by introducing re-projection losses that measure the discrepancy between the projection of 3D mesh with image observation in form of 2D poses [3] or silhouettes [28, 2]. Even though they utilize very strong hand priors in form of mean hand shape and by operating on a low-dimensional PCA space, using re-projection losses with weakly-labeled data still does not guarantee that the resulting 3D poses will be anatomically plausible. Therefore, all these methods rely on a large number of fully labeled training data. In the body pose estimation literature, such methods generally resort to adversarial losses to ensure plausibility [13].

Similar to our work, a few other methods also aim to refine implausible 3D poses by imposing biomechanical limits on the structure of the hands [6, 5, 19]. These methods define a hand using a kinematic chain primarily for inverse kinematics. However, the set of parameters are restricted via hard constraints, therefore, the possibility of using them for neural network training remains unanswered. On the other hand, we propose soft-constraints that are fully-differetiable, interpretable, and can be used effectively used for neural network training as we will show in the rest of this paper.

3 Method

Figure 2: Approach overview. Given a model that takes an RGB image and predicts the 3D joints, we apply our proposed biomechanical constraints in form of losses to the predicted 3D pose. These guide the network to predict anatomically correct poses.

Our method is summarized in Figure 2. Our key contribution is a set of novel constraints that constitute a biomechanical model of the human hand and capture the bone lengths, joint angles and 3D structure of the palm. We use these biomechanical constraints to provide an inductive bias to the neural network. Specifically, we guide the network to predict anatomically plausible hand poses for weakly-supervised data (when partial supevision is provided, e.g. 2D only), which in turn increases generalizability. The model can be combined with any backbone architecture that predicts 3D keypoints. We first introduce the notations used in this paper followed by the details of the proposed biomechanical losses. Finally, we discuss the integration with a variant of Iqbal et al. architecture [12].


We use bold capital font for matrices, bold lowercase for vector and roman font for scalars. We assume a right hand. The joints

define a kinematic chain of the hand starting from the root joint and ending in the fingertips. For the sake of simplicity, the joints of the hands are grouped by the fingers, denoted as the respective set , visualized in Fig. 3a. Each , except the root joint (CMC), has a parent, denoted as . We define a bone as the vector pointing from the parent joint to its child joint. Hence . The bones are named according to the child joint. For example, the bone connecting MCP to PIP is called PIP bone. We define the five root bones as the MCP bones, where one endpoint is the root . Intuitively, the root bones are those that lie within and define the palm. We define the bones with to correspond to the root bones of fingers . We denote the angle between the vectors . The interval loss is defined as . The normalized vector is defined as . Lastly, is the orthogonal projection operator, projecting orthogonally onto the - plane where , are vectors.

a) Joint skeleton structure b) Root bone structure c) Angles. Flexion: Left – Abduction: Right
Figure 3: Illustration of our proposed biomechanical structure.

3.1 Biomechanical constraints

Our goal is to integrate our biomechanical soft constraints (BMC ) into the training procedure that encourages the network to predict feasible hand poses. We seek to avoid iterative optimization approaches such as inverse kinematics in order to avert significant increases in training time.

The proposed model consists of three functional parts, visualized in Fig. 3. First, we consider the length of the bones, including the root bones of the palm. Second, we model the structure and shape of the palmar region, consisting of a rigid structure made up of individual joints. To account for inter-subject variability of bones and palm structure, it is important to not enforce a specific mean shape. Instead, we allow for these properties to lie within a valid range. Lastly, the model describes the articulation of the individual fingers. The finger motion is described via modeling of the flexion and abduction of individual bones. As their limits are interdependent, they need to be modeled jointly. As such, we propose a novel constraint that takes this interdependence into account.

The limits for each constraint can be attained manually from measurements, from the literature (e.g [5, 19]), or acquired in a data-driven way from 3D annotations, should they be available.

Bone length. For each bone , we define an interval of valid bone length and penalize if the length lies outside of this interval:

This loss encourages keypoint predictions that yield valid bone lengths. Fig. 3a shows the length of a bone in blue.

Root bones. To attain valid palmar structures we first interpret the root bones as spanning a mesh and compute its curvature by following [16]:


Where is the edge normal at bone :


Positive values of denote an arched hand, for example when pinky and thumb touch. A flat hand has no curvature. Fig. 3b visualizes the mesh in dashed yellow and the triangle over which the curvature is computed in dashed purple.

We ensure that the root bones fall within correct angular ranges by defining the angular distance between neighbouring , across the plane they span:


We constrain both the curvature and angular distance to lie within a valid range and :

ensures that the predicted joints of the palm define a valid structure, which is crucial since the kinematic chains of the fingers are originate from this region.

Joint angles. To compute the joint angles, we first need to define a consistent frame of a local coordinate system for each finger bone . must be consistent with respect to the movements of the finger. In other words, if one constructs given a pose , then moves the fingers and corresponding into pose , the resulting should be the same as if constructed from directly.

We assume right-handed coordinate systems. To construct , we define two out of three axes based on the palm. We start with the first layer of fingers bones (PIP bones). We define their respective -component of as the normalized bone of their respective parent bone (in this case, the root bones): . Next, we define the -axis, based on the plane normals spanned by two neighbouring root bones:


Where is defined as in Eq. 2. Lastly, we compute the last axis . Given , we can now define the flexion and abduction angles. Each of these angles are given with respect to the local -axis of . Given in its local coordinates wrt. , we define the flexion and abduction angles as:


Fig. 3c visualizes and the resulting angles. Note that this formulation leads to ambiguities, where different bone orientations can map to the same (, )-point. We resolve this via an octant lookup, which leads to angles in the intervals and respectively. See appendix for more details.

Given the angles of the first set of finger bones, we can then construct the remaining two rows of finger bones. Let denote the rotation matrix that rotates by and such that , then we iteratively construct the remaining frames along the kinematic chain of the fingers:


This method of frame construction via rotating by and ensures consistency across poses. The remaining angles can be acquired as described in Eq. 5.

Lastly, the angles need to be constrained. One way to do this is to consider each angle independently and penalize them if they lie outside an interval. This corresponds to constraining them within a box in a 2D space, where the endpoints are the min/max of the limits. However, finger angles have inter-dependency, therefore we propose an alternative approach to account for this. Given points that define a range of motion, we approximate their convex hull on the -plane with a fixed set of points . The angles are constrained to lie within this structure by minimizing their distance to it:


Where is the distance of point to the hull . Details on the convex hull approximation and implementation can be found in the appendix.

3.2 Refinement

The 2.5D joint representation allows us to recover the value of the absolute pose up to a scaling factor . This is done by solving a quadratic equation dependent on the 2D projection and relative depth values , as proposed in [12]. In practice, small errors in or can result in large deviations of . This leads to big fluctuations in the translation and scale of the predicted pose, which is undesirable. To alleviate these issues, we employ an MLP to refine and smooth the calculated :



is a multilayered perceptron with parameters

that takes the predicted and calculated values , , and outputs a residual term. Alternatively, one could predict directly using an MLP with the same input. However, as the exact relationship between the predicted variables and is known, we resort to the refinement approach instead of requiring a model to learn what is already known.

3.3 Final loss

The biomechanical soft constraints is constructed as follows:


Our final model is trained on the following loss function:


Where , and are the L loss on any available 2D, and labels respectively. The weights balance the individual loss terms.

4 Implementation

For all experiments, we use a ResNet-50 backbone [10]. The input to our model is a

RGB image from which the 2.5D representation is directly regressed. The model and its refinement step is trained on fully supervised and weakly-supervised data. The network was trained for 70 epochs using SGD with a learning rate of

and a step-wise learning rate decay of after every 30 epochs. We apply the biomechanical constraints directly on the predicted 3D keypoints .

5 Evaluation

In this section we introduce datasets used, show the performance of our proposed and compare in a variety of settings. More specifically, we study the effect of adding weakly supervised data to complement fully supervised training. All experiments are conducted in a setting where we assume access to a fully supervised dataset, as well as a supplementary weakly supervised real dataset. Therefore we assume access to 2D ground-truth annotations and the computed constraint limits. We study two cases of 3D supervision sources:

Synthetic data. We choose RHD. Acquiring fully labeled synthetic data is substantially easier compared to real data. Section 5.3-5.5 consider this setting.

Partially labeled real data. In Section 5.6 we gradually increase the number of 3D annotated real samples to study how the proposed approach works under different ratio of fully to weakly supervised data.

To make clear what kind of supervision is used we denote if 3D annotation is used from dataset . We indicate usage of 2D from dataset as . Section 5.3 and 5.4 are evaluated on the FH dataset.

5.1 Datasets

Name Type joints train/test
# #
Rendered Hand Pose (RHD) [30] Synth 21 42k / 2.7k
FreiHAND (FH) [31] Real 21 33k / 4.0k
Dexter+Object (D+O) [21] Real 5   -  / 3.1k
Hand-Object 3D (HO-3D) [8] Real 21 11k / 6.6k
Table 1: Overview of datasets used for evaluation.

Each dataset that provides 3D annotation also provides the camera intrinsics. Therefore the 2D pose can be easily acquired from the 3D pose. We provide an overview of datasets used in this work in Table  1. The test set of HO-3D and FH are available only via online submission system with limited number of total submissions. Therefore for the ablation study (Section 5.4) and inspecting the effect of weak-supervision (Section 5.3) we divide the training set into a training and validation. For these sections, we chose to evaluate on the validation set of FH due to its large number of samples and variability in both hand pose and shape.

5.2 Evaluation Metric

HO-3D. The evaluation score given by an online submission system is computed as the mean joint error in mm. The INTERP is the performance on test frames sampled from training sequences that are not present in the training set. The EXTRAP is the performance on test samples that have neither hand shapes nor objects present in the training set.

FH. The evaluation score given by an online submission system is computed as the mean joint error in mm. Additionally, the area under the curve (AUC) of the percentage of correct keypoints (PCK) plot is reported. The PCK values lie in an interval from 0 mm to 50 mm with 100 equally spaced thresholds. Both the aligned (using procrustes analysis) and unaligned scores are given. We report the aligned score. The unaligned score can be found in the appendix.

D+O. We report the AUC for the PCK thresholds of 20 to 50 mm comparable with prior work [31, 28, 3]. For [12, 20, 14, 30] we report the numbers as presented in [28] as they consolidate all AUC of related work in a consistent manner using the same PCK thresholds. For [2], we recomputed the AUC for the same interval based on the values provided by the authors.

5.3 Effect of Weak-Supervision

Effect of weak-supervision Description Mean Error
2D (px) Z (mm) 3D (mm)
Fully supervised, synthetic+real 3.72 5.69 8.78
+ (ours)    + BMC 3.70 5.44 8.60
Fully supervised, synthetic only 12.35 20.02 30.82
+ + Weakly supervised, real 3.80 17.02 20.92
   + (ours)    + BMC 3.79 9.97 13.78
Table 2: The effect of weak-supervision as evaluated on the validation split of FH. Training on synthetic data (RHD) leads to poor accuracy on real data (FH). Adding weakly-supervised data (FH) improves 3D prediction performance due to better aligning them with the 2D projection. By incorporating our proposed biomechanically constraints we significantly improve 3D pose accuracy due to more accurate .

We first inspect closely how weak-supervision affects the performance of the model using fully-supervised RHD and weakly-supervised FH. More specifically, we decompose the 3D prediction performance on the validation set of FH in terms of its 2D () and depth component () via the pinhole camera model and evaluate their individual accuracy.

We train four models using different data sources. 1. Full 3D supervision on both synthetic RHD and real FH (), which serves as an upper bound when 3D labels are available. 2. Fully supervised on RHD only which constitutes our lower bound on accuracy (). 3. Fully supervised on RHD with straightforward application of additional weakly-supervised FH (). 4. The same as the previous setting, but applying our proposed constraints during the training procedure ().

Table 2 shows the results. The model trained with full 3D supervision from real and synthetic data reflects the best setting. Adding BMC  loss during training slightly reduces 3D error (from to mm) primary due to the regularization effect and better manifold learning during training. When the model is trained only on synthetic data () we observe a significant drop (from to ) in 3D error due to the poor generalization from synthetic data. When weak-supervision is provided from the real data () straightforwardly, the error is reduced from to mm. However, inspecting this more closely we observe that the improvement comes mainly from 2D error reduction ( to px), whereas the depth component is improved marginally ( to mm, improvement). Observing these samples qualitatively (see Fig. 1), we see that many do not adhere to biomechanical limits of the human hand. By penalizing such violations via adding our proposed set of constraints to the weakly supervised setting we see a significant improvement in 3D error ( to mm) which is due to improved depth accuracy ( to mm, improvement). Inspecting the qualitative samples of this model, we see that it predicts the correct 3D pose in challenging settings such as heavy self- and object occlusion, despite having never seen such samples in 3D. Since our model prescribes a valid range, rather than a specific pose, slight deviations from the ground truth 3D pose have to be expected which explains the small remaining gap from the fully supervised model.

5.4 Ablation Study

Here we quantify the individual contributions of each of our proposals on the validation set of FH dataset. We also reproduce these results on the HO-3D dataset, which can be found in supplementary. Each error metric is computed for the root-relative 3D pose.

(a) Input image
(b) Ground-truth
Figure 4: Impact of our proposed losses. (a) All 3D poses project to the same 2D pose. (b) Ground-truth pose. (c) The bone length results in poses that have correct bone lengths, but may have invalid angles and palm structure. (d) Including the root bone loss imposes a correct palm, but the fingers are still articulated wrong. (e) Adding the angle loss leads to the finger bones having correct angles. The resulting hand is plausible and close to the ground-truth.

Refinement network.

Ablation Study EPE (mm) AUC
mean median 
w/o refinement 11.20 8.62 0.95
w. refinement (ours) 9.76 8.14 0.97
Table 3: Effect of refinement

In Table  3 we evaluate the impact of refinement as proposed in Sec. 3.2. We train two models using full supervision on FH (). The first model (w/o refinement) does not include the refinement step, whereas the second does (w.refinement). Using refinement, the mean error is reduced (by

mm) which indicates that the refiner effectively reduces outliers.

Components of BMC.

Ablation Study EPE (mm) AUC
mean median 
+ 20.92 16.93 0.81
+ (ours) 17.58 14.81 0.88
      +  (ours) 15.48 13.49 0.91
            +  (ours) 13.78 11.61 0.92
+ 8.78 7.25 0.98
Table 4: Effect of BMC components.

In Table  4, we perform a series of experiments where we incrementally add each of the proposed constraints. For 3D guidance, we use the synthetic RHD dataset and only use the 2D annotation of FH. We first run the baseline model trained only on this data (). Next, we add the bone length loss , followed by the root bone loss and the angle loss . An upper bound is given by our model trained fully supervised on both datasets (). We can see that each component contributes positively towards the final performance, totalling a decrease of mm in mean as compared to our weakly-supervised baseline, significantly closing the gap to the fully supervised upper bound. For a qualitative assessment of the individual losses can be seen in Fig. 4.

Co-dependency between angles.

Ablation Study EPE (mm) AUC
mean median 
Independent 15.57 13.45 0.91
Dependent 13.78 11.61 0.92
Table 5: Effect of angle constraints

In Table  5, we demonstrate the importance of modeling the dependencies between the flexion and abduction angle limits, as opposed to regarding them independently. The valid angle range is defined to lie within an approximated convex hull in the angle plane. Using co-dependent angle limits yields a decrease in mean error of mm.

Constraint limits.

Ablation Study EPE (mm) AUC
mean median 
Approximated 16.14 13.93 0.90
Computed 13.78 11.61 0.92
Table 6: Effect of limits

In Table  6, we investigate the effect of the used limits on the final performance, as one may have to resort to approximations. To simulate this situation, we instead take the hand parameters from RHD and perform the same weakly-supervised experiment as before ( + ). Approximating the limits from another dataset slightly increases the error, but still clearly outperforms the 2D baseline ( mm vs mm).

5.5 Bootstrapping with Synthetic Data

We validate our proposed constraints on the test set of FH and HO-3D. We train the same four models like in Sec. 5.3 using RHD as the fully supervised dataset and the weakly-supervised real data -.

All results in this subsection we perform training on the full dataset and evaluate on the official test split via the online submission system. Additionally, we evaluate the cross-dataset performance on D+O dataset to demonstrate how our proposed constraints improves generalizability and compare with prior work [12, 14, 3, 28, 2].

Description R=FH R=HO-3D
Fully sup. upper bound 0.90 0.82 18.22 5.02
Fully sup. lower bound 1.60 0.69 20.84 33.57
+ + Weakly sup. 1.26 0.75 19.57 25.16
    +  (ours)     + BMC 1.13 0.78 18.42 10.31
Table 7: Bootstrapping results on the respective test split, as evaluated by the online submission system. Training on RHD leads to poor accuracy on both FH and HO-3D. Adding more weakly-supervised data (2D annotations) in a straightforward fashion improves results. By incorporating our proposed BMC  constraints our model yields a significant boost in accuracy, especially evident in the INTERP score of HO-3D.

FH. The first section of Table  7 shows the dataset performance for R = FH. As hypothesised, training solely on RHD () performs the worst. Adding real data () with the 2D annotations improves performance, as we reduce the domain gap between the real and synthetic data. Including the proposed BMC () results in a performance boost.

HO-3D. The second section of Table  7 shows the dataset performance for R = HO-3D. A similar trend can be observed. Most notably, for the INTERP score, our constraints (+) yield an improvement of mm for INTERP. This is significantly larger than the relative improvement the 2D data adds (), which is mm. For the EXTRAP score the BMC yields an improvement of mm, which is close to the mm gained from 2D data. This demonstrates that our proposed BMC is beneficial in leveraging 2D data more effectively in unseen scenarios.

D+O Annotations used
Synth. Real Scans AUC
Ours (weakly sup.) 3D 2D only 0.82
Zhang (2019) [28] 3D 3D 3D 0.82
Boukhayma (2019) [3] 3D 3D 3D 0.76
Iqbal (2018)* [12] 3D 3D 0.67
Baek (2019)* [2] 3D 3D 3D 0.61
Zimmermann (2018)[30] 3D 3D 0.57
Spurr (2018) [20] 3D 3D 0.51
Mueller (2018)* [14] 3D Unlabeled 0.48
Table 8: Datasets used by prior work for evaluation on D+O. With solely fully-supervised synthetic and weakly-supervised real data, we outperform recent works and perform on par with [28]. All other works rely on full supervision from real and synthetic data. *These works report unaligned results.

D+O In Table  8 we demonstrate the cross-data performance on the D+O dataset for R = FH. Most recent works have made use of MANO [3, 28, 2], making use of a low-dimensional embedding acquired via a dataset of highly detailed hand scans. Additionally, they require custom synthetic data [2, 3] to acquire their shape parameter. Using only fully supervised synthetic data and weakly-supervised real data in conjunction with , we reach state-of-the-art.

5.6 Bootstrapping with Real Data

Figure 5: Bar plot depicting the number of 3D samples required to reach a certain aligned AUC on FH. When few 3D labels are available, we require roughly half the amount of 3D data.

In case the 3D annotations for real data are available we study the impact of our biomechanical constrains on reducing the number of labeled samples required. To this end, we train a model in a setting where a fraction of the data contains the full 3D joint annotation and the remainder contains only 2D supervision.

For this section we choose FH and use the entire training set and evaluate via the online submission system. For each fraction of fully labelled data we evaluate two models. The first is trained on both the fully and weakly labeled samples. The second is trained with the addition of our proposed constraints. We show the results in Fig. 5. For a given AUC, we plot the number of labeled samples required to reach it. We observe that for lower labeling percentages, the amount of labeled data required is approximately half when including BMC . This showcases the effectiveness of our proposed constraints in low label settings and supports our initial hypothesis that it decreases the requirement for fully annotated training data.

6 Conclusion

In this paper we propose a set of fully differentiable biomechanical constraints for weakly-supervised training of 3D hand pose estimation networks. Our model consists of a novel procedure to encourage anatomically correct predictions of a backbone network via a set of novel losses that penalize invalid bone length, and palmar structures as well as deviations from valid joint angles. We show that the model can take the co-activation of fingers and individual joints into consideration. Furthermore, we have experimentally shown that our constraints can more effectively leverage weakly-supervised data, which show improvement on both within- and cross-dataset performance. Our method reaches state-of-the art performance on the aligned Dexter+Object objective using only 2D annotations from real data. Moreover, the method reduces the need of training data by half in low label settings on freiHAND.

=0mu plus 1mu


  • [1] Athitsos, V., Sclaroff, S.: Estimating 3D hand pose from a cluttered image. In: CVPR (2003)
  • [2] Baek, S., Kim, K.I., Kim, T.K.: Pushing the envelope for RGB-based dense 3d hand pose estimation via neural rendering. In: CVPR (2019)
  • [3] Boukhayma, A., Bem, R.d., Torr, P.H.: 3d hand shape and pose from images in the wild. In: CVPR (2019)
  • [4] Cai, Y., Ge, L., Cai, J., Yuan, J.: Weakly-supervised 3d hand pose estimation from monocular RGB images. In: ECCV (2018)
  • [5] Chen Chen, F., Appendino, S., Battezzato, A., Favetto, A., Mousavi, M., Pescarmona, F.: Constraint study for a hand exoskeleton: human hand kinematics and dynamics. Journal of Robotics 2013 (2013)
  • [6] Cobos, S., Ferre, M., Uran, M.S., Ortego, J., Pena, C.: Efficient human hand kinematics for manipulation tasks. In: IROS (2008)
  • [7] Ge, L., Ren, Z., Li, Y., Xue, Z., Wang, Y., Cai, J., Yuan, J.: 3d hand shape and pose estimation from a single RGB image. In: CVPR (2019)
  • [8] Hampali, S., Oberweger, M., Rad, M., Lepetit, V.: Ho-3d: A multi-user, multi-object dataset for joint 3d hand-object pose estimation. arXiv preprint arXiv:1907.01481 (2019)
  • [9] Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M.J., Laptev, I., Schmid, C.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)
  • [10] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
  • [11] Heap, T., Hogg, D.: Towards 3D hand tracking using a deformable model. In: FG (1996)
  • [12] Iqbal, U., Molchanov, P., Breuel, T., Gall, J., Kautz, J.: Hand pose estimation via latent 2.5d heatmap regression. In: ECCV (2018)
  • [13] Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
  • [14] Mueller, F., Bernard, F., Sotnychenko, O., Mehta, D., Sridhar, S., Casas, D., Theobalt, C.: Ganerated hands for real-time 3d hand tracking from monocular RGB. In: CVPR (2018)
  • [15] Panteleris, P., Oikonomidis, I., Argyros, A.: Using a single rgb frame for real time 3d hand pose estimation in the wild. In: WACV (2017)
  • [16] Reed, N.: What is the simplest way to compute principal curvature for a mesh triangle? https://computergraphics.stackexchange.com/questions/1718/what-is-the-simplest-way-to-compute-principal-curvature-for-a-mesh-triangle (2019)
  • [17] Romero, J., Kjellström, H., Kragic, D.: Hands in action: real-time 3D reconstruction of hands in interaction with objects. In: ICRA (2010)
  • [18] Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing hands and bodies together. In: SIGGRAPH-Asia (2017)
  • [19] Ryf, C., Weymann, A.: The neutral zero method—a principle of measuring joint function. Injury 26, 1–11 (1995)
  • [20] Spurr, A., Song, J., Park, S., Hilliges, O.: Cross-modal deep variational hand pose estimation. In: CVPR (2018)
  • [21] Sridhar, S., Mueller, F., Zollhoefer, M., Casas, D., Oulasvirta, A., Theobalt, C.: Real-time joint tracking of a hand manipulating an object from RGB-D input. In: ECCV (2016)
  • [22] Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: ICCV (2017)
  • [23] Tekin, B., Bogo, F., Pollefeys, M.: H+o: Unified egocentric recognition of 3d hand-object poses and interactions. In: CVPR (2019)
  • [24] Wu, Y., Lin, J.Y., Huang, T.S.: Capturing natural hand articulation. In: ICCV (2001)
  • [25] Xiang, D., Joo, H., Sheikh, Y.: Monocular total capture: Posing face, body, and hands in the wild. In: Proc. CVPR (June 2019)
  • [26] Yang, L., Yao, A.: Disentangling latent hands for image synthesis and pose estimation. In: CVPR (2019)
  • [27] Zhang, J., Jiao, J., Chen, M., Qu, L., Xu, X., Yang, Q.: 3d hand pose tracking and estimation using stereo matching. arXiv preprint arXiv:1610.07214 (2016)
  • [28] Zhang, X., Li, Q., Mo, H., Zhang, W., Zheng, W.: End-to-end hand mesh recovery from a monocular RGB image. In: ICCV (2019)
  • [29] Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3d human pose estimation in the wild: a weakly-supervised approach. In: ICCV (2017)
  • [30] Zimmermann, C., Brox, T.: Learning to estimate 3d hand pose from single rgb images. In: ICCV (2017)
  • [31] Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images. In: ICCV (2019)