Accurate hand-pose estimation from monocular depth images is vital for applications such as fine-grained control in human–computer interaction, or virtual and augmented reality . However, it is a challenging task due to e.g. complex poses, self-similarities, and self-occlusions. Many existing methods address these challenges with powerful learning-based tools. Such methods dominate the benchmarks on large public datasets such as NYU , and Hands in the Million Challenge (HIM) . Most of these approaches are trained in a fully supervised manner to predict the full set of 21 hand keypoint positions in 3D. However, the current lack of large-scale training datasets that are accurate and diverse causes such methods to overfit. This makes it difficult to generalize well to new settings, or even across benchmarks . Retraining these methods on different data requires the full set of (3D) keypoint annotations, which are tedious to obtain. More importantly, this process is prone to errors in the data annotations, either due to measurement errors, or due to human errors. Additionally, methods that learn a direct mapping from depth image to keypoints often ignore the inherent geometry of the hands, such as constant bone lengths or joint angle limits. As such, albeit their general good performance, these methods may produce bio-mechanically implausible poses .
An alternative to learning-based approaches are model-based hand tracking methods, such as [15, 27, 32, 35], among others. These methods use generative hand models to recover the pose that best explains the image through an analysis-by-synthesis strategy. While not suffering from anatomical inconsistencies, and generalizing better to yet-unseen scenarios, they require good initialization of the model parameters in order to minimize the non-convex energy function.
Our method addresses the shortcomings of both approaches with a generative model-based loss embedded into a learning-based method. Based on a volumetric Gaussian hand model, this loss incorporates additional annotation-free self-supervision from the depth image. When combined with anatomical priors, this supervision can take the place of the majority of joint annotations for resolving hand pose and bone length ambiguities. In total, our approach reduces the number of required annotations from to , a % decrease. At the same time, the learning-based framework enables accurate and efficient inference during test time without requiring initialization. This effectively combines the main advantages of the two popular categories.
Most existing methods that utilize a model-based loss [13, 14, 38, 43] do not explain the input images in a generative manner. As such, they still require the full set of annotated keypoints per frame. Additionally, due to the reliance on the annotations as the only source of supervision, these methods can overfit to errors and biases in the annotations. We demonstrate that our method can overcome such errors through the use of our additional generative loss.
We summarize our main contributions as follows:
Compared to classical fully supervised methods, our generative loss significantly reduces the amount of annotations need to accurately infer the full hand pose.
Despite ambiguities resulting from the reduced annotations, our method can simultaneously infer pose and bone lengths at test time.
We provide a new dataset, HandId, which includes fingertips and wrist annotations for users to address the lack of hand shape variations in existing datasets.
Most importantly, for the first time we demonstrate that such an approach can produce hand pose predictions that better fit to the depth image than the “ground truth” annotations it is trained on.
Ii Related Work
Existing approaches for hand pose estimation can be broadly categorized into learning-based approaches, model-based approaches, and hybrid approaches.
Discriminative, learning-based approaches. These methods regress the pose parameters directly from image and annotation pairings. Tompson et al. 
first used a Convolutional Neural Network (CNN) for the task of hand pose estimation. From this foundation, many methods[18, 24] develop novel architectures and training procedures to better model the nonlinear manifold of hand poses. Recent methods investigate the use of different input representations such as multi-view, voxels, and point clouds, [5, 6, 7] to take advantage of known camera intrinsics.
Generative, model-based approaches. These methods iteratively refine an estimated pose by fitting a 3D hand model to the input depth image. Previous work demonstrated that energies based on articulated, rigid, part-based models of the hand can be optimized to provide good tracking [20, 15]. Additional 3D hand representations, including continuous subdivision surfaces , collection of Gaussians [26, 28], sphere meshes , and articulated signed distance functions , have been proposed with the goal of creating detailed models that are still fast to optimize.
Hybrid approaches. These methods combine learning-based and model-based approaches into one framework to combine the strengths of both. One class of hybrid methods uses learning-based components in a tracking framework to initialize, update, or otherwise guide the tracker’s convergence to the correct pose [19, 23, 27, 29, 30, 12]. These methods are more robust than the traditional model-based trackers, but must trade-off model and solver efficiency with accuracy during runtime. Another class of hybrid methods uses the learning-based framework and incorporates a model-based loss, usually based on a kinematic skeleton [13, 14, 38, 40, 43]. These methods can better enforce anatomically plausible pose predictions by including pose priors losses in the model space. However, since the model is not generative, they still rely on difficult-to-acquire annotations and overfit to annotation errors if present.
Our proposed hybrid method incorporates a loss that is both generative and model-based, into the learning framework. Unlike other hybrid approaches, the generative model provides supervision from the input depth image. With that, we are able to reduce the requirements on the quantity and accuracy of annotations needed for training, thereby reducing the necessary human effort for data annotation.
Autoencoders are used for obtaining compressed representations from a distribution of inputs. They consist of an encoder that maps the input to a compact code, and a decoder that maps the code back to the (approximate) input. Although the encoder and decoder are usually trained jointly, the encoder can learn to invert a generative model being used as the decoder in an self-supervised manner. As a learning objective, the model-based decoder can draw upon the entire training corpus as regularizer to overcome local minima that arise from noise or ambiguities present in a single image. Tewari et al.  use such an autoencoder with a face model to estimate and disentangle face shape, expression, reflectance, and illumination. Recently, such approaches have also been proposed for hand pose estimation in RGB images [2, 3, 8]. These methods have in common that they use geometric cues (e.g. annotated silhouettes and paired depth map) as supervision for training. Dibra et al.  and Wan et al.  use autoencoders for inverting a hand model to solve the hand pose estimation problem from depth images without additional cues. In contrast to , our use of a volumetric Gaussian hand model  as a decoder provides a stronger shape prior than their unconstrained articulating point cloud. This allows our method to solve the much harder problem of combined pose and shape estimation, while their method cannot adapt the hand shape at test time. Although conceptually our method has similarities with the (concurrently developed) work , our method uses a smooth hand representation compared to their spherical representation. More importantly, we extensively study the effect of a model-based generative loss when training with erroneous annotations (e.g. as present in the HIM  dataset), and hence we believe both works can be seen as complementary.
The main idea of our approach is to explain a depth image of a hand based on a generative hand model, cf. Fig. 2
. Given a depth image as input, we use a CNN-based encoder to obtain a low-dimensional embedding of the depth image. Our parametric model-based decoder is build upon a parametric hand model that produces a volumetric representation of the hand from a given code vector. Since the code vector from the encoder initializes a parametric model, this enforces a semantically meaningful code vector. By using a suitable representation of the input depth image, we are able to efficiently and analytically compute the overlap between the “rendered” volumetric hand representation generated by the decoder and the input depth image. To be more specific, we approximate the surface of the hand with a collection of 3D Gaussians rigidly attached to a kinematic hand skeleton model. The corresponding Gaussians in image space can be obtained by projecting the 3D Gaussians using the camera intrinsics. Moreover, the depth image is also represented with image space Gaussians by quadtree-decomposing the image into regions of homogeneous depth and fitting an image Gaussian to each region. The similarity between the model and the image can then be described as the depth-weighted overlap of all pairs of model and image Gaussians. This overlap serves as generative model-based loss during network training and ensures that the predicted hand faithfully represents the observed data. To enforce plausible poses and bone lengths, we add additional prior losses to avoid inter-penetrations of hand parts, violations of joint limits, and unphysiological combinations of bone lengths. Lastly, supervision for a small subset of keypoints is provided as a way to mitigate the multiple minima present in the non-convex energy. At test time, the so-trained encoder is able to directly regress the hand pose and bone length parameters.
Iii-a Hand Model
Kinematic Skeleton. Our kinematic skeleton parameterizes hand shape in terms of bone lengths, and pose as articulation angles with respect to the joint axes. It comprises 20 bones with lengths and 26 degrees of freedom (DOF) (20 angles of articulation and 6 additional DOF for global rotation and translation), see Fig. 3.
To ensure that the predicted bone length vector is plausible, is parameterized by an affine model constructed using 20 PCA basis vectors, i.e.
Here, is the average bone length vector and
are the linear PCA basis vectors of the bone length variations scaled by their standard deviations. By scaling the basis vectors,
follows an isotropic standard normal distribution, and deviations along each basis are penalized inversely to how much natural variation exists in that direction. Bothand are obtained from bone length vectors computed from 10,000 hand meshes sampled from the linear PCA parameters of the MANO model .
The pose parameter vector controls the angles of articulation with respect to the joint axes in the forward kinematics chain, as well as the global translation and rotation of the entire hand, where the latter is is parameterized using Euler angles. Given the bone length parameters and pose , we can obtain the joint positions by applying forward kinematics .
Iii-B Depth Image Representation
The depth image is represented by a collection of 2D image Gaussian and depth value pairs . Each Gaussian and depth value pair summarizes a roughly homogeneous region with a single depth. To obtain these regions, we use quadtree clustering to recursively divide the image into sub-quadrants until the depth difference within each region is below a threshold (we used mm for our experiments). The Gaussian , is chosen so that is the center and is half the side length of the region. The associated depth value is then the average depth value of the quadrant.
Iii-C Model-based Decoder
To measure the quality of the predicted hand pose and bone length parameters for a given input depth image, we incorporate a decoder layer that “renders” the 3D model representation to a 2.5D representation similar to the image representation. The camera-facing surface of the -th 3D Gaussian is approximated by a projected 2D Gaussian using the intrinsic camera matrix and an associated depth value . For details please refer to the supplemental document.
Iii-D Loss Layer
For training the network, the loss is decomposed into an unsupervised dissimilarity term for measuring the discrepancy between depth image and hand model, to prevent self intersection, for regularizing the bone length parameters , for regularizing the joint angles , and a supervised term for explaining the provided joint locations. The relative importance of each term is balanced with scaling factors . With that, the total energy reads
In the following we describe the individual energy terms.
Iii-D1 Dissimilarity Measure
To measure the overall similarity between two given (2D Gaussian, depth) tuples, we weight the similarity between the two Gaussians by their distance in depth values . The pairwise similarity between image Gaussian and projected model Gaussian is defined using the integral over the product of the two functions. Since in our case the model Gaussian directly depends on the hand pose vector and bone length vector , is a function of these parameters and is given by
Since only measures the 2D overlap of the two Gaussians, we weight it based on the depth difference
where is the standard deviation of the unprojected Gaussian associated with . This decreases the similarity score between two tuples whenever the depth values are far apart, and thereby forces the model to not only match the area of the hand in the depth image, but also the observed depth values.
The overall similarity is defined as the sum over all possible pairings between the model and the image Gaussians, and is given by
where the denominator is the self-similarity of the image Gaussians used for normalization. We use since minimizing the loss maximizes the similarity.
Iii-D2 Collision Prior
To ensure that the surface represented by the 1 isosurface of the 3D Gaussians does not (self-)interpenetrate, a repulsive term based on the 3D overlap of the model Gaussians is used. Overloading the notation for the Gaussian overlap (cf. Eq. (4)) to denote the similarity between two different model Gaussian components, we analogously define
so that Gaussians of the model do not overlap in 3D.
Iii-D3 Bone Length Prior
To keep the bone lengths plausible, we impose the loss
which penalizes the deviation of the predicted bone length parameters from the mean parameter. With that, this term helps to keep the predictions in the high probability region of the normal distribution used in the PCA prior.
Iii-D4 Joint Limits
To keep joint articulations within mechanically and anatomically plausible limits, a joint limit penalty is imposed using
where and are the lower and upper limits of , which are defined based on anatomical studies of the hand .
Iii-D5 Joint Location Supervision
We impose an additional supervision loss on a small subset of joint positions
in order to help the optimizer converge to a good minimum in the overall generative loss function. We use a combination of 2D and 3D joint location supervisions (depending on availability). If for a given joint with indexa full 3D supervision is provided, the distance between the annotation and the model joint is given by their distance. If only 2D supervision is provided, is the closest distance between and the ray to which the annotation is projected using the camera intrinsics. Hence, is defined as
where is the -th joint obtained from applying forward kinematics with the model parameters.
Due to inaccuracies in the annotation, the ground truth may conflict with the observed image. Hence, we modify the joint loss to account for annotation uncertainty by introducing a “slack” radius that models the expected uncertainty in millimeters. All predictions within this radius of the ground truth will not be penalized. This allows the encoder to be more robust to erroneous annotations. Together, the joint loss for the subset of joints is defined as
We evaluate the impact of our generative model-based loss on pose accuracy and bone length consistency when trained with a reduced set of keypoints. Additionally, we show qualitative results of our predictions and the erroneous “ground truth” on existing datasets to demonstrate the regularizing effect of our loss against annotation errors.
Iv-a Architecture and Training
We use Resnet-18 
pre-trained on ImageNet as our encoder, as it is fast to use and refine, and achieves good accuracy. The encoder is trained with the Adam optimizer, using a learning rate of and a batch size of
. Our pipeline runs in Caffe, where we implemented the decoder and other losses as custom layers. During training, a forward-backward pass with batch size takes ms (for comparison: ResNet-50 architecture takes ms). A forward pass at test time takes only ms.
We evaluate on two common benchmarks, the NYU Hand Pose dataset  and the Hands in the Million Challenge dataset (HIM) . We additionally introduce our own HandID dataset for training to address the lack of hand shape variation in the NYU training data.
NYU Hand Pose Dataset. The NYU Hand Pose dataset  is collected using Microsoft Kinect sensors. It contains depth images from a single subject in the training set, and depth images from two subjects in the test set.
Our HandID Dataset. Since the NYU training data only contains a single subject, we introduce additional training data with more hand shape variations to enable our method to learn this variation and hence adapt to different users at test time. We captured a dataset of frames ( x ) from subjects with the Intel SR300 sensor, which we call HandID. A total of pixels that correspond to the fingertips and wrist are annotated per frame. Occluded keypoints were indicated as such. During training, a batch contains examples from both HandID and the NYU dataset with a mixing ratio of .
To emphasize that it is significantly easier to obtain just the fingertips and wrist keypoints, we asked users to annotate all keypoints for a set of depth images. We observed that additional keypoints take longer to annotate (each joint annotation takes times longer) and are less consistent across users (with average distance to mean of pixels vs pixels). In total, the full annotation of joints for images requires minutes, while our subset only needs minutes.
Hands in the Million Challenge (HIM) Dataset. We evaluated our method on the Hands in the Million Challenge (HIM) dataset , where we discovered a systematic error in the “ground truth” annotations. Although the 2D projection of the keypoints into the image plane looks plausible, the 3D keypoint locations do not match the anatomical locations of hand joints (see Fig. 6). To quantitatively show this, we use the minimum-distance-to-point-cloud (MDPC) per joint to approximately quantify how well the joint predictions agree with the observed depth image. The NYU annotations and the erroneous HIM annotations have median MDPCs of mm (avg mm) and mm (avg mm), respectively. By assuming that the physical joint is located roughly at the center of the finger, the HIM annotations would imply an implausible finger thickness of mm, while the NYU annotations estimates a more reasonable thickness of mm. We hypothesize that there is a systematic pose-dependent error in corresponding the 3D magnetic sensor positions to the depth camera coordinate (see Fig. 4 of the Supplementary Document). Using our generative model-based loss, we are able to obtain predictions that are significantly more consistent with the observed depth images. The detailed experiment is presented in Section IV-D.
F1 score of k-means clustering of bone lengths vectors for the two subjects in the test set.
Pre-processing. Similar to established procedures [1, 19], we first localize the hand by using the ground truth joint locations and crop the image to a fixed-size cube with mm side length. Once localized, the image is re-cropped using the same cube, but centered at the average depth. We then scale it to x with a scaled depth range between . During training, in-image-plane translations and rotations, as well as depth augmentations, are applied. This pre-processing step is used for all datasets.
Model Mismatch. Due to different joint locations in the NYU hand model and ours, only of the commonly evaluated keypoints have a rough equivalence to our model (Fig. 4, left). Hence, we compare our predictions with the state-of-the-art predictions on this subset (Match-11). To better demonstrate that our method can infer the positions of unsupervised keypoints, we evaluate our algorithm for self comparison on an expanded set of NYU keypoints (All-21) which roughly correspond to anatomical joints of our kinematic skeleton (Fig. 4, right). The results are further broken down for the supervised keypoints (Lab-6) and the unsupervised keypoints (Unlab-15).
Iv-C Ablation Studies
For the ablation study, we perform quantitative evaluations on the NYU dataset.
Keypoint Accuracy. Removing components from our full method (Full) reduces accuracy. See Table (a) for the average per-joint error in millimeters, and Fig. 4(a) for the percentage of correct frames curve.
Bone Lengths. For bone length evaluation, we cannot directly compare the ground-truth bone lengths to our predicted bone lengths due to the mismatch in model definitions (cf. Fig. 4, left). Instead, we treat the bone lengths of the hand as a -dimensional vector and use k-means clustering with to separate the bone length vectors of the two subjects in the test set of the NYU dataset. In Table (c), we show the F1 scores (defined as ) of the two clusters. k-means is meaningful for this task as clustering bone lengths of the annotations (Ground Truth) results in perfect F1 scores for both subjects. Note that poses with high self-occlusion result in depth images with very little information to help disambiguate hand shapes. Thus, one cannot expect methods that perform per-frame estimation to attain a perfect F1 score from the given supervision.
Discussion. Given the reduced supervision, it is ambiguous whether the loss is minimized by deforming the bone lengths or updating the hand pose. Consequently, the method without bone length prior can arbitrarily distort the bone lengths as long as the fingertips are correctly estimated (w/o Prior, see Table (a)). This results in a significant drop in accuracy for keypoints without direct supervision (Unlab-15). Correspondingly, k-means clustering fails to find consistent clusters for the two subjects.
However, the bone length prior alone is not enough to resolve the ambiguity in hand shape. A similar drop in accuracy on unsupervised keypoints (Unlab-15) occurs when the dissimilarity loss is removed (w/o Dissim., see Table (a)). This is because statistically plausible bone lengths can still vary wildly to accommodate the fingertip annotations, without being constrained to explain the image. Pose priors in the form of joint limits (w/o Lim.) and collision prior (w/o Collision) additionally constrain the articulations, which improve the keypoint accuracy.
Due to the NYU training data containing only one hand shape, it is sufficient for the method to consistently regress this particular set of bone lengths when HandID is not present (w/o HandID, see Table (a)). As a result, the method cannot learn to discriminate between hand shapes of different users, leading to F1 scores that are close to random. Hence, for the unseen hand shape in the test set, the method cannot minimize the joint loss (see Eq. (10)) of the supervised keypoints, which leads to greatly reduced accuracy on supervised keypoints (Lab-6). This mode of failure can be accounted for if hand shape variations are present in the training data. The result of this can be seen in our full method (Full, see Table (a)).
Iv-D Comparison to the State of the Art (SotA)
Although state-of-the-art methods obtain mean per-joint errors lower than mm (e.g. [6, 39]) on the HIM dataset, we emphasize that this is against the erroneous “ground truth”. We train our method using a “slack” radius of mm to account for the error and show better fitting pose predictions than even the “ground truth” (see Fig. 6 and Fig. 4 of Supplemental Material for more qualitative evaluation).
For a more fair quantitative evaluation, we instead use minimum-distance-to-point-cloud (MDPC) to approximate how well the predictions fit the input. On the HIM test set of  comprising of images, our method achieves median MDPCs of mm (avg mm), while  achieves mm (avg mm). Our predictions better match the NYU annotations with median MDPCs of mm (avg mm). This suggests that our method better fits the observed input while most state-of-the-art methods learn to replicate the errors in the training data.
We further show that the dissimilarity loss helps to overcome annotation errors by testing the method trained on HIM data on the NYU data (See Fig. 4(c)). Without the dissimilarity loss, the method performs significantly worse.
On the NYU dataset (see Table (b) and Fig. 4(b)), our method outperforms the other kinematic model-based methods of Zhou et al.  and Malik et al.  while requiring less keypoint annotations. Although methods that directly predict 3D joint positions perform better [1, 6, 17], we emphasize that these methods without a model-based generative loss are liable to learning the annotation errors as shown.
We compare our method to Dibra et al.  and Wan et al. . Although we were unable to obtain their predictions on the subset of Match-11 keypoints, we note that Dibra et al.  have a similar “uncorrected” percentage of correct frames curve on all keypoints to Zhou et al. , which we greatly outperform, and we achieve similar performance to Wan et al. ’s method with single view training.
While their methods do not require any annotation, our method additionally solves the more ambiguous and harder problem of adapting to the hand shapes of the user during test time, while their methods can only fit to the average hand shape of the training data or to preset bone lengths.
Iv-E Adaptation to a New Domain
Despite the aforementioned annotation errors, the HIM dataset contains a large variety of views, poses, and hand shapes that could be used to supplement the NYU training data to help improve generalization. We show that our method can still benefit from data with erroneous annotations (see Table (b) and Fig. 4(b)). We trained our method by mixing the NYU, HIM, and HandID datasets in a single batch with a ratio of ::. When HIM data is used without the dissimilarity loss (Full + HIM w/o Dissim.), the annotation errors cause the overall performance to degrade. With our dissimilarity loss enabled (Full + HIM), the self-supervision ignores the annotation errors and improves the results.
V Limitations & Discussion
Although our method outperforms other kinematic model-based methods, even with less annotations, there is still a gap to recent learning-based methods that regress 3D joint positions. However, these methods
are not explicitly penalized for producing anatomically implausible shapes due to the lack of an underlying kinematic hand model, and
are prone to overfit to errors in the training annotations, as well as to errors in the annotation collection method.
Additionally, for poses with heavy self-occlusions, the monocular depth data is not sufficient to resolve ambiguities with the reduced annotation set used by our method. Extra supervision, such as from temporal consistency, or from multi-view constraints (as done in ), is needed to estimate the pose and shape in these cases.
We have shown that a generative model-based loss can reduce the amount of supervision needed to learn both the pose and shape of hands. This greatly reduces the amount of annotations needed to adapt a method to data obtained in a new domain. Furthermore, we show that the generative model-based loss helps to regularize against annotation errors, for example on the HIM dataset, while existing methods overfit to these errors. This demonstrates the importance of ensuring that the model predictions explain not only the annotations but also the image itself.
-  (2018) Augmented skeleton space transfer for depth-based hand pose estimation. In CVPR, Cited by: §IV-B, §IV-D, (b).
-  (2019) Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. In CVPR, Cited by: §II.
-  (2019) 3D hand shape and pose from images in the wild. In CVPR, Cited by: §II.
-  (2017) How to refine 3d hand pose estimation from unlabelled depth data?. In 3DV, Vol. , pp. 135–144. External Links: Cited by: §II, §IV-D.
-  (2016) Robust 3d hand pose estimation in single depth images: from single-view cnn to multi-view cnns. CVPR, pp. 3593–3601. Cited by: §II.
-  (2017) 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In CVPR, pp. 5679–5688. External Links: Cited by: §II, §IV-D, §IV-D, (b).
-  (2018) Point-to-point regression pointnet for 3d hand pose estimation. In ECCV, Cited by: §II.
-  (2019) 3D hand shape and pose estimation from a single rgb image. In CVPR, Cited by: §II.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §IV-A.
-  (2014) Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093. Cited by: §IV-A.
-  (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §IV-A.
-  (2018) Top-down model fitting for hand pose recovery in sequences of depth images. Image and Vision Computing 79, pp. . External Links: Cited by: §II.
-  (2017-10) Simultaneous hand pose and skeleton bone-lengths estimation from a single depth image. In 3DV, pp. 557–565. External Links: Cited by: §I, §II, §IV-D, (b).
-  (2018) Structure-aware 3d hand pose regression from a single depth image. In EuroVR, pp. 3–17. External Links: Cited by: §I, §II.
-  (2013) Dynamics based 3d skeletal hand tracking. In GI, GI ’13, pp. 63–70. External Links: Cited by: §I, §II.
-  (2008) Analysis-by-synthesis by learning to invert generative black boxes. In ICANN, Cited by: §II.
-  (2017) DeepPrior++: improving fast and accurate 3d hand pose estimation. In ICCVW, Vol. , pp. 585–594. External Links: Cited by: §IV-D, (b).
Hands deep in deep learning for hand pose estimation. In CVWW, pp. 1–10 (English). Cited by: §II.
-  (2015) Training a feedback loop for hand pose estimation. In ICCV, Washington, DC, USA, pp. 3316–3324. External Links: Cited by: §II, §IV-B.
-  (2011) Efficient model-based 3d tracking of hand articulations using kinect. In BMVC, pp. 101.1–101.11. External Links: Cited by: §II.
-  (2017-11) Embodied hands: modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) 36 (6), pp. 245:1–245:17. Cited by: §III-A.
Kinematic model of the hand using computer vision. PhD thesis. Cited by: §III-D4.
-  (2015) Accurate, robust, and flexible real-time hand tracking. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI ’15, New York, NY, USA, pp. 3633–3642. External Links: Cited by: §II.
-  (2016) . In CVPR, Vol. , pp. 4150–4158. External Links: Cited by: §II.
-  (2018) FingerInput: capturing expressive single-hand thumb-to-finger microgestures. In ISS, pp. 177–187. Cited by: §I.
-  (2014) Real-time hand tracking using a sum of anisotropic gaussians model. In 3DV, Cited by: §II.
-  (2015) Fast and robust hand tracking using detection-guided optimization. In CVPR, Cited by: §I, §II, §II, §III-A.
-  (2011) Fast articulated motion tracking using a sums of gaussians body model. In ICCV, pp. 951–958. External Links: Cited by: §II, §III-A.
-  (2015) Cascaded hand pose regression. In CVPR, Vol. , pp. 824–832. External Links: Cited by: §II.
-  (2015) Opening the black box: hierarchical sampling optimization for estimating human hand pose. In ICCV, Vol. , pp. 3325–3333. External Links: Cited by: §II.
-  (2016-07) Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. In TOG, Vol. 35. Cited by: §II.
-  (2017-11) Articulated distance fields for ultra-fast tracking of hands interacting. ACM Trans. Graph. 36 (6), pp. 244:1–244:12. External Links: Cited by: §I, §II.
-  (2017) MoFA: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In ICCV, Cited by: §II.
-  (2016-11) Sphere-meshes for real-time hand modeling and tracking. ACM ToG 35, pp. 1–11. External Links: Cited by: §II.
-  (2017-11) Online generative model personalization for hand tracking. ACM Trans. Graph. 36 (6), pp. 243:1–243:11. External Links: Cited by: §I.
-  (2014-09) Real-time continuous pose recovery of human hands using convolutional networks. In ToG, Vol. 33, New York, NY, USA, pp. 169:1–169:10. External Links: Cited by: §I, §II, §IV-B, §IV-B, (a).
-  (2019) Self-supervised 3d hand pose estimation through training by fitting. In CVPR, Cited by: §II, §IV-D, §V.
-  (2018) Model-based hand pose estimation for generalized hand shape with appearance normalization. In arXiv, Cited by: §I, §I, §II.
-  (2018) HandMap: robust hand pose estimation via intermediate dense guidance map supervision. In ECCV, Cited by: Fig. 6, §IV-D, §IV-D.
-  (2016) Spatial attention deep net with partial pso for hierarchical hybrid hand pose estimation. In ECCV, Cited by: §II.
-  (2017) BigHand2.2m benchmark: hand pose dataset and state of the art analysis. In CVPR, Cited by: §I, §II.
-  (2018) Depth-based 3d hand pose estimation: from current achievements to future goals. In CVPR, Cited by: §IV-B, §IV-B.
-  (2016) Model-based deep hand pose estimation. In IJCAI, IJCAI’16, pp. 2421–2427. External Links: Cited by: §I, §II, §IV-D, §IV-D, (b).