1 Introduction
Human hand pose estimation and reconstruction in 3D is a long standing problem in the computer vision and graphics communities that has applications in various domains such as virtual and augmented reality and humanmachine interaction
[35, 15, 46, 13]. With the abundance of affordable commodity depth cameras, the research literature focused naturally more on estimating 3D hand pose through depth observations (e.g. [62, 66, 10, 36, 61]), and many works also explored this problem in multiview setups [33, 65, 41, 8, 31, 50]. When it comes to a monocular color input, the problem becomes inherently ill posed due to the increased depth and scale ambiguities, but that did not prevent several researchers [4, 9, 51, 57, 63, 39] from attempting to solve it in the past albeit with limited results. More recently, the unprecedented success of deep learning on similar tasks motivated new work with encouraging results for 3D hand pose from single images [68, 27, 7, 47, 14]. Nevertheless, this task remains particularly difficult: Unlike clothed human bodies or faces, hands have an almost uniform appearance and lack characteristic local features such as eyes and mouths in faces. Unlike bodies, they can have more complex pose configurations and they can be captured from a much wider range of views. Furthermore when observed in the wild, as in dataset Mpii+Nzsl [44] (Figure 9), their images usually contain external occlusion, selfocclusion, clutter and blur due to their motion. Besides, hands are often small in size compared to the scene so cropped patches around them have low resolutions.The main obstacles for 3D hand pose estimation from images with deep learning include: (i) The lack of large datasets annotated with reliable 3D groundtruth and (ii) the incapability of the current 3D annotated datasets to make networks generalize greatly to challenging images in the wild.
The first point is tackled by the literature through training with synthetic images [68], populating datasets by transforming synthetic images into real looking ones [27], or leveraging auxiliary types of data in training like depth [7, 47]. We propose a different and simple yet efficient approach to alleviate both challenges (i) and (ii) by circumventing heavy dependence of 3D data in training: Instead of relying on images paired with 3D joint annotations to learn a prior on hand geometry, we exploit a recently proposed differentiable articulated mesh deformation hand model [40] built with linear blend skinning [18], and we reformulate the prediction problem into a learningbased model fitting, that can be trained using both 3D and 2D joint annotations. Training with 2D annotated images allows access to larger datasets (e.g. Panoptic [44]) with a fair share of annotated images in the wild (e.g. Mpii+Nzsl [44]) compared to datasets with 3D groundtruth, thus helping improve generalization to this type of challenging data. Given an input image, and optionally 2D joint detections obtained from an independent CNN, a deep convolutional encoder predicts the hand shape and pose parameters and view parameters. The modelbased decoder uses the latter to generate a 3D triangulated hand mesh and its underlying skeleton, along with their reprojection in image domain (see Figure 1).
Our contributions in this paper are as follows: This work is the first to propose endtoend learning of both 3D hand shape and pose from a single RGB image. We also show for the first time that the prior knowledge of factored hand shape and pose in a precomputed linear blend skinning [18] hand model [40] combined with a deepconvolutional encoder yields stateoftheart performance in 3D pose prediction from images, and produces geometrically valid and plausible 3D reconstructions, without the need for postprocessing optimizations [27]. We show that this strategy combined with training on 2D annotated datasets of images in the wild produces good generalization in 3D hand reconstruction for challenging images in uncontrolled environments.
We evaluate our work both quantitatively in terms of 3D pose estimation and qualitatively using various public datasets. These evaluation sets account for cases of hand interaction with objects, occlusion and clutter, and contain egocentric view images, third person view images, and images in the wild. Our method obtains state oftheart results on standard benchmarks, even compared to methods using additional depth information in training [7, 47], camera intrinsics [27, 34], and postprocessing optimization [27]. Our method shows superior qualitative results on a challenging dataset of images in the wild (Figure 9 & supplementary material).
2 Related work
There is a rich literature on 3D hand pose and reconstruction from depth [62, 66, 10, 36, 61, 11, 43, 45, 19, 20, 24, 30, 37, 48, 52, 53, 59, 64], image and depth [26, 32, 49, 28], stereo [33, 65, 41] and multiple images [8, 31, 50]. We focus hereby on research material that solely considers a single color input image.
3D hand pose from a single image
Predeep learning
There have been attempts to solve 3D hand pose estimation from a monocular color input prior to deep learning with both discriminative and generative approaches [4, 9, 51, 57, 63, 39]. However, most of these methods have limited performance and depend on various requirements such as careful initialization and prior knowledge of the background.
Deep learning
The work of [68] was the first to propose 3D hand pose estimation from single images using deep learning. Their method consists of the concatenation of three networks that segment the hand, predict 2D joints, and then predict 3D joints subsequently. The work of [27] shows that the previous method generalizes poorly to real world images since a major part of their training data is synthetic. In turn, they ([27]) propose to use CycleGAN [67] to transform synthetic 3D annotated images of hands into real looking ones. The resulting data is used to train a regressor to predict 2D and 3D hand joints. A final optimization step fits a 3D skeleton to the former 2D and 3D predictions using the camera intrinsics. The method in [34] consists in an optimization that fits a hand model to 2D joint detections obtained from a stateoftheart CNN [44]. We also use a predefined hand model [40] but within a pipeline trained endtoend.
Depth regularization
Recent works tackle depth ambiguity in 3D hand pose prediction from images by leveraging depth maps in training. [7] proposes to reduce the dependency on noisy 3D annotations in real datasets by introducing a network that predicts full depth maps from the 3D joints. This depth regularizer is trained with groundtruth depth data for both real and synthetic training images, while the 3D predictions are only supervised by the reliable synthetic labels. The authors in [47] use multiple variational autoencoders sharing the same latent space each autoencoding a separate hand data modality (e.g. images, 2D joints, 3D joints). They show that the auxiliary autoencoders help regularize the latent space and produce improved crossmodal predictions (e.g. image to 3D joints). [14] shows that predicting an implicit 2.5D heatmap representation yields improved 3D predictions even without explicit full depthmap supervision.
Hand models
Many hand models have been proposed in the literature primarily aiming at tracking depth and color data, where the hand is modelled using various techniques such as assembled geometric primitives [32], sum of Gaussians [50], sphere meshes [58] or loop subdivision of a control mesh [20]. In order to better capture the shape of the hand, [32] defines scaling terms to allow bone length to vary, while [54] precalibrates the shape to fit the hand of interest. The work in [20] was the first to learn hand shape variation from scans with linear blend skinning [18]. The model proposed recently in [40] and referred to as MANO improves on the latter by learning pose dependent corrective blend shapes [25], thus modelling both hand shape and pose and generating more realistic posed meshes. We use the MANO [40] model in this work.
Modelbased decoders
Several works propose to combine deep convolutional encoders with generative models as decoders for human face [56, 55] and body [17, 60]
3D reconstruction. In many of these works, the decoder is a combination of a parametric model (e.g. linear face model
[6], SMPL [25]) and a reprojection/rendering module. While most works fix these decoders, some propose to tune them after a supervised initialization [2, 22, 55]. This is the first work to propose a combination of a CNN encoder with a fixed generative hand model [40] for the problem of 3D hand reconstruction from images.3 Overview
As illustrated in Figure 1, our pipeline takes as input an image of a hand and optionally 2D joint heatmaps from an independent hand detector. A deep convolutional encoder processes the input and generates a set of hand shape and pose parameters, and a set of view parameters , and . The hand parameters are fed to a differentiable articulated mesh deformation hand model that generates a triangulated 3D mesh and its underlying 3D skeleton. These outputs are then reprojected into the image domain through a weak perspective camera model controlled by the view parameters. The reprojection module and the hand model together form a modelbased decoder whose parameters are fixed and do not require training. The encoder is pretrained with synthetic examples that we created as elaborated in Section 6. We note that the training of our pipeline is done endtoend using 2D and 3D joint annotations without supervision on the hand and view parameters, except for a regularization on the hand parameters to ensure their magnitude is small. We detail and explain the functioning of the various parts of the pipeline in the following.
4 Hand model
We use the MANO hand model [40] which is based on the SMPL model for human bodies [25]. It is an articulated mesh deformation model represented with a differentiable function taking as input two sets of parameters and that control the shape and pose of the generated hand respectively:
(1) 
where is a linear blend skinning [18] function applied to a template hand triangulated mesh rigged with a kinematic tree of joints. represents the joint locations and it is learned as a sparse linear regressor from mesh vertices, and are the blend weights.
In order to reduce the artifacts of linear blend skinning such as overly smooth outputs and mesh collapse around joints, the hand template is obtained by deforming a mean mesh with both shape and pose corrective blend shapes, and respectively, as follows:
(2) 
where is the
element of a vector concatenating rotation matrix coefficients from all joints for pose
and is the rest pose. The model constants are learned using registered hand scans from subjects performing roughly hand poses.In the SMPL model, the pose vector stacks the angleaxis values of the joints. To help the hand model generate physically plausible poses, the authors in [40]
reduce this pose representation to a linear embedding by performing Principal Component Analysis on angleaxis values of the joints in the data collected to build the model. The pose vector
contains the resulting main coefficients from PCA instead of the angleaxis values. coefficients are retained for the pose (), and coefficients are used to represent the shape as well ().Given input shape and pose parameters, we obtain a hand mesh of vertices and faces, along with the corresponding 3D joints where is the global rigid transformation induced by pose . As the hand skeleton in MANO does not contain finger tip joints, we append with 5 vertices from the hand mesh that correspond to these keypoints. The final 3D joint output counts keypoints.
5 Camera model
In order to reproject the 3D hand mesh vertices and 3D joints into the 2D image plane, we use the weak perspective model. This approximation allows us to train with annotated images even in the absence of camera intrinsics, which is the case of images in the wild obtained from Youtube videos for instance (e.g. dataset Mpii+Nzsl). Given a global rotation matrix , a translation and a scaling , the projection writes:
(3) 
(4) 
where is the orthographic projection.
6 Encoder
Given an input hand image, the goal of the encoder is to predict the corresponding hand pose and shape parameters and camera parameters . We use the ResNet50 network [12] and we adjust the final fully connected layer to output a vector . We note that global rotation is encoded with axisangle values and is hence represented with parameters. We also experiment with feeding 2D hand joint heatmaps obtained with a state of the art method [44] as additional channel input to the hand RGB image.
Network pretraining
We pretrain the encoder to ensure that the camera and hand parameters converge towards acceptable values. For this purpose, we create a synthetic dataset of paired hand images with their groundtruth camera and hand parameters using the same generative model that we use as a decoder. Hand geometries are obtained by sampling poses and shapes then applying rotations , translations and scalings . Although the work of [40]
does not model hand appearance, the authors provide the scans used to build the geometry model with their registered counterparts. The original scans come with 3D coordinates and RGB values for each vertex. We create example hand appearances using the registered scan topology: To each vertex in a registered mesh, we assign the RGB value of the closest vertex in the original corresponding scan, and we interpolate these values inside faces. The textured hands are finally rendered on top of random background images. Figure
2 shows examples from the resulting dataset.7 Training objective
We combine multiple losses to train our pipeline: A 2D joint reprojection loss , a 3D joint loss , a hand mask loss and a model parameter regularization loss .
(5) 
where , and are weighting factors.
2D joint reprojection loss
This loss ensures that the reprojected hand joints in the image plane coincide with the groundtruth 2D hand joint annotations:
(6) 
where is a vector containing the groundtruth 2D hand joint coordinates. We use the loss to account for inaccuracies in hand annotations in our training datasets.
3D joint loss
When groundtruth 3D hand joint annotations are available (e.g Stereo dataset), this loss minimises the distance between the latter and the 3D hand joints generated by the hand model:
(7) 
where is a vector containing the groundtruth 3D hand joint coordinates.
Hand mask loss
We introduce this novel loss to help speed up the convergence of our training and refine hand shape predictions. This loss penalizes reprojected hand vertices that lie outside of the hand region in a binary mask, which is precomputed prior to training:
(8) 
where is an occlusionaware hand mask, i.e if pixel is inside the hand region even if the hand is occluded in the image, and otherwise. Notice that these masks cannot be obtained with hand skin segmentation methods (e.g.[23, 5]) as they are sensitive to occlusions.
We obtain an approximation of these masks (Figure 3) for training images using the GrabCut [42]
algorithm, by initializing the foreground, background and probable foreground/background regions using the 2D hand joint annotations: As illustrated in Figure
2(b), we create an initial foreground by drawing lines of pixel width connecting joints according to the hand skeleton hierarchy. Pixels inside triangles formed by joints that belong anatomically to the hand surface are appended to the foreground as well. The undecided area is defined as the region within pixels at most from the foreground, and the remaining pixels are assigned to the initial background.Regularization loss
This loss acts on the hand model parameters at the encoder output by reducing their magnitude for physically plausible hand reconstructions and reduced mesh distortions:
(9) 
where is a weighting factor.
8 Evaluation
We evaluate our method’s 3D pose estimates quantitatively and its 3D reconstructions qualitatively on several datasets and with respect to stateoftheart methods. Without access to camera intrinsics, and trained merely with 2D and 3D joint annotations, our method outperforms deep learning based competing methods, including those using additional depth information in training or camera intrinsics in evaluation. We show particularly superior 3D reconstructions on images in the wild that present challenging situations such as blur, low resolution, occlusion, extremely varying viewpoints and hand pose configurations.
Similar to [44], input images are assumed to be crops of fixed size around the hand. To achieve this, we use a hand keypoint detector [44] to find the tightest rectangular box of edge size containing the hand. Images are then cropped with a square patch of size centred at the same 2D location as the previously detected box. The resulting cropped images are subsequently resized to have a width and height of . As done in [44], we use the right hand model and images of left hands are flipped horizontally.
Finally, we train our pipeline (Figure 1) using the Adam solver [21] with a learning rate of and weight decay of .
Datasets
Our training set is made of dataset Panoptic [44] that counts images, the training set of Mpii+Nzsl [44] that counts images following the split in [44], and the training set of Stereo [65] that counts images following the split in [68]. This amounts to training images, (Stereo) with 3D joint annotations, and the remaining (Panoptic & Mpii+Nzsl) with 2D joint annotations only.
The Panoptic dataset [44] contains hands in various poses observed from multiple views in the Panoptic studio [16]. The Mpii+Nzsl dataset [44] is a combination of manually annotated images from The MPII Human Pose dataset [3] containing images from YouTube videos, and images from the New Zealand Sign Language (NZSL) Exercises of the Victoria University of Wellington [38]. The Stereo dataset [65] shows an actor’s hand in third person view counting with the fingers and moving the hand randomly.
For evaluation, we use the Dexter+Object dataset [49] which shows interactions of an actor’s hand with a cuboid object from a third person view. To evaluate robustness to occlusions and clutter, we use the EgoDexter dataset [28] that displays a hand from an egocentric view interacting with various objects. We finally use the testing set of Mpii+Nzsl [44] to asses performance in the presence of blur, low resolution, varying viewpoints and hand pose configurations, among other characteristics of datasets of images in the wild.
Metrics
To quantitatively evaluate 3D hand pose estimations, we report the percentage of correct points in 3D (3D PCK) and the average 3D Euclidean distance between the estimated 3D joints and the groundtruth when the latter is available, where distances are expressed in millimeters (mm). When only groundtruth 2D joint annotations are available (dataset Mpii+Nzsl), we report 2D PCK and the average 2D Euclidean distance between the estimated 2D reprojected joints and the groundtruth, where distances are expressed in pixels (px).
Comparison to competing methods
Ours RGB  Ours RGB+2D  Ours 2D  2D fit  
3D distance  9.76  10.18  10.46  23.21 
We compare our results on the Stereo dataset to stateoftheart methods in terms of 3D PCK in Figures 4 and 5, and we show 3D joint errors in Table 1. Figure 4 shows deep learning based methods (Cai et al. [7], Iqbal et al. [14], Spurr et al. [47], Mueller et al. [27], Zimm. et al [68]) and Figure 5 shows methods that do not rely on deep learning (Panteleris et al. [34], PSO, ICPPSO, CHPR [65] ). For this experiment, we add a keypoint at the center of the hand palm in the MANO model [40] as an interpolation of several mesh vertices to match the annotation of the Stereo dataset. We reproduce the evaluation protocol initially introduced in [68] by training on sequences and testing on the remaining and aligning predictions to the groundtruth hand root joint. Additionally, for a fair comparison to works [7, 47, 14], we crop the hand images for this experiment such that the final image size is the size of the hand. Using RGB image input only, we obtain stateofthe results even though some of the competing methods use depth data in training ([7, 14]) in addition to images, while others ([27]) postprocess their output with an optimization that fits their hand skeleton to their 3D and 2D joint predictions, and which uses the camera intrinsics as an additional input.
Ours RGB  Ours RGB+2D  Ours 2D  2D fit  Spurr et al.  Zimm. et al.  
3D distance  33.16  25.53  25.93  41.18  40.20  34.75 
Ours RGB  Ours RGB+2D  Ours 2D  2D fit  Spurr et al.  Zimm. et al.  
3D distance  51.87  45.58  45.33  56.59  56.92  52.77 
Figure 6 shows the performance of our method under occlusions and clutter with 3D PCK on the Dexter+Object dataset, and Table 2 shows 3D joint errors. Additionally, Figure 7 shows our results on a hand in egocentric view and in interaction with various objects in terms of 3D PCK on the EgoDexter dataset, and Table 3 shows 3D joint errors. Our method outperforms the competition in these settings as illustrated in the Figures. We note that we show relative 3D pose estimates for all methods except [14] where the authors report absolute values.
Ours RGB  Ours RGB+2D  Ours 2D  2D fit  Zimm. et al.  
2D distance  23.04  18.95  20.65  22.36  59.40 
We expect our method to perform particularly well on datasets of images in the wild, as our training set contains this type of data and accounts for hands in low resolution, blurry, occluded and in challenging views and pose configurations. In fact, we compare our results to [68] on the testing set of Mpii+Nzsl dataset in Figure 8 and Table 4 through 2D PCK and 2D joint error respectively. We outperform [68] with a substantial margin as the Figure shows. The superiority of our method on this dataset is visually confirmed in Figure 9.
Comparison to 2D fitting
In the case where 2D joint detections are used as input, an alternative way of solving 3D hand pose estimation is to perform a 2D fitting between the reprojected hand model joints and the keypoints detected on the image, in a similar fashion to the work proposed by [34]. Our implementation of this strategy consists in minimizing the following objective function with respect to the weak perspective camera parameters and the hand shape and pose parameters :
(10)  
where is the hand joint estimate confidence provided by the detector CNN [44]. Similarly to the loss in Equation 9, regularization in the second line of Equation 10 is important to ensure plausible 3D hand reconstructions. We perform this optimization using Powell’s Dogleg method [29] within the Chumpy [1] framework.
We compare this method (2D fit) to our proposed approach on datasets Stereo, Dexter+Object and EgoDexter with 3D PCK in Figures 5, 6 and 7 and 3D joint error in Tables 1, 2 and 3 respectively, and also on dataset Mpii+Nzsl with 2D PCK in Figure 8 and 2D joint error in Table 4. Results show that our approach outperforms the 2D fitting based strategy for all datasets. We observe that while the optimization catches up slightly with our method in 2D (Mpii+Nzsl), its performance drops considerably in 3D. Our method benefits clearly from solving the fitting problem in a learning framework and leverages visual cues in predicting the 3D hand position and configuration, while the 2D fitting relies merely on the 2D joint detection information. We also outperform the 2D fitting based method in [34] that uses a similar hand model to [32] and a perspective projection model on dataset Stereo in Figure 5.
Ablation study
We evaluate the difference between using images only (Ours RGB), using 2D joint heatmaps obtained from a stateoftheart hand detector [44] only (Ours 2D), and finally using both together as input (Ours RGB+2D). We carry comparisons on datasets Stereo, Dexter+Object and EgoDexter with 3D PCK in Figures 5, 6 and 7 and 3D joint error in Tables 1, 2 and 3 respectively, and also on dataset Mpii+Nzsl with 2D PCK in Figure 8 and 2D joint error in Table 4. On dataset Stereo, training on images alone yields the best performance, while training with a combination of images and 2D joint heatmaps is generally the most suitable approach for the other datasets that we tested on.
Qualitative
Figure 9 shows our 3D hand reconstructions on the challenging testing set of Mpii+Nzsl. As shown in this Figure, the input data (8(a)) displays images of hands that are sometimes blurry, low resolved, occluded, viewed from varying viewpoints and in varying pose configurations. We show our 3D mesh overlaid on the input image (8(b)) and in alternative views (8(c), 8(d)). We also compare our hand skeleton (8(e)) to the 2D and 3D pose predictions of [68] (8(f), 8(g)) and the 3D predictions of [47] (8(h)). Our method obtains visually plausible results while the methods in [68] and [47] fail to predict good 3D pose estimates for many cases in the Mpii+Nzsl dataset. We show more examples in the supplementary material.
9 Conclusion
We presented a method to predict 3D hand pose and shape from a single RGB image. We combine a deep convolutional encoder with a generative hand model as decoder and train the resulting network endtoend with 2D and 3D hand joint annotated images. The encoder predicts hand parameters that are inputted to the hand model, and view parameters that are used to reproject the generated 3D hand into the image domain. We generate stateoftheart results on 3D pose benchmarks and show compelling 3D reconstruction on a challenging set of images in the wild. This method could benefit greatly from a hand appearance model by leveraging a photometric loss in training as proposed in [56, 55] for faces. One possible extension to this work could be to allow some components of the MANO [40] model such as the corrective blend shapes and (Equation 2) to be finetuned in training for improved performance.
References
 [1] http://chumpy.org.

[2]
V. F. Abrevaya, S. Wuhrer, and E. Boyer.
Multilinear autoencoder for 3d face model learning.
In WACV, 2018.  [3] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
 [4] V. Athitsos and S. Sclaroff. Estimating 3d hand pose from a cluttered image. In CVPR, 2003.
 [5] S. Bambach, S. Lee, D. J. Crandall, and C. Yu. Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In ICCV, 2015.
 [6] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Conference on Computer graphics and interactive techniques, 1999.
 [7] Y. Cai, L. Ge, J. Cai, and J. Yuan. Weaklysupervised 3d hand pose estimation from monocular rgb images. In ECCV, 2018.
 [8] T. E. de Campos and D. W. Murray. Regressionbased hand pose estimation from multiple cameras. In CVPR, 2006.
 [9] M. de La Gorce, D. J. Fleet, and N. Paragios. Modelbased 3d hand pose estimation from monocular video. IEEE transactions on pattern analysis and machine intelligence, 2011.
 [10] L. Ge, Y. Cai, J. Weng, and J. Yuan. Hand pointnet: 3d hand pose estimation using point sets. In CVPR, 2018.
 [11] L. Ge, H. Liang, J. Yuan, and D. Thalmann. Robust 3d hand pose estimation in single depth images: from singleview cnn to multiview cnns. In CVPR, 2016.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
 [13] W. Hürst and C. Van Wezel. Gesturebased interaction via finger tracking for mobile augmented reality. Multimedia Tools and Applications, 2013.
 [14] U. Iqbal, P. Molchanov, T. Breuel, J. Gall, and J. Kautz. Hand pose estimation via latent 2.5 d heatmap regression. In ECCV, 2018.
 [15] Y. Jang, S.T. Noh, H. J. Chang, T.K. Kim, and W. Woo. 3d finger cape: Clicking action and position estimation under selfocclusions in egocentric viewpoint. IEEE Transactions on Visualization and Computer Graphics, 2015.
 [16] H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh. Panoptic studio: A massively multiview system for social motion capture. In ICCV, 2015.

[17]
A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik.
Endtoend recovery of human shape and pose.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2018.  [18] L. Kavan and J. Žára. Spherical blend skinning: A realtime deformation of articulated models. In Symposium on Interactive 3D Graphics and Games, 2005.
 [19] C. Keskin, F. Kıraç, Y. E. Kara, and L. Akarun. Hand pose estimation and hand shape classification using multilayered randomized decision forests. In ECCV, 2012.
 [20] S. Khamis, J. Taylor, J. Shotton, C. Keskin, S. Izadi, and A. Fitzgibbon. Learning an efficient model of hand shape variation from depth images. In CVPR, 2015.
 [21] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[22]
S. Laine, T. Karras, T. Aila, A. Herva, S. Saito, R. Yu, H. Li, and
J. Lehtinen.
Productionlevel facial performance capture using deep convolutional neural networks.
In Symposium on Computer Animation, 2017.  [23] C. Li and K. M. Kitani. Pixellevel hand detection in egocentric videos. In CVPR, 2013.
 [24] P. Li, H. Ling, X. Li, and C. Liao. 3d hand pose estimation using randomized decision forest with segmentation index points. In ICCV, 2015.
 [25] M. Loper, N. Mahmood, J. Romero, G. PonsMoll, and M. J. Black. SMPL: A skinned multiperson linear model. ACM Transactions on Graphics (TOG) (Proc. SIGGRAPH Asia), 2015.
 [26] A. Makris and A. Argyros. Modelbased 3d hand tracking with online hand shape adaptation. 2015.
 [27] F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar, D. Casas, and C. Theobalt. Ganerated hands for realtime 3d hand tracking from monocular rgb. In CVPR, 2018.
 [28] F. Mueller, D. Mehta, O. Sotnychenko, S. Sridhar, D. Casas, and C. Theobalt. Realtime hand tracking under occlusion from an egocentric rgbd sensor. In ICCV, 2017.
 [29] J. Nocedal and S. J. Wright. Nonlinear Equations. Springer, 2006.
 [30] M. Oberweger, P. Wohlhart, and V. Lepetit. Training a feedback loop for hand pose estimation. In ICCV, 2015.
 [31] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Markerless and efficient 26dof hand pose recovery. In ACCV, 2010.
 [32] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Efficient modelbased 3d tracking of hand articulations using kinect. In BMVC, 2011.
 [33] P. Panteleris and A. Argyros. Back to rgb: 3d tracking of hands and handobject interactions based on shortbaseline stereo. Hands17 Workshop ICCV, 2017.
 [34] P. Panteleris, I. Oikonomidis, and A. Argyros. Using a single rgb frame for real time 3d hand pose estimation in the wild. In WACV, 2018.
 [35] T. Piumsomboon, A. Clark, M. Billinghurst, and A. Cockburn. Userdefined gestures for augmented reality. In IFIP Conference on HumanComputer Interaction, 2013.
 [36] G. Poier, D. Schinagl, and H. Bischof. Learning pose specific representations by predicting different views. In CVPR, 2018.
 [37] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun. Realtime and robust hand tracking from depth. In CVPR, 2014.
 [38] D. A. R. McKee, D. McKee and E. Pailla. Nz sign language exercises. Deaf Studies Department of Victoria University of Wellington.
 [39] J. Romero, H. Kjellström, and D. Kragic. Hands in action: realtime 3d reconstruction of hands in interaction with objects. In ICRA, 2010.
 [40] J. Romero, D. Tzionas, and M. J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics (TOG), 2017.
 [41] R. Rosales, V. Athitsos, L. Sigal, and S. Sclaroff. 3d hand pose reconstruction using specialized mappings. In ICCV, 2001.
 [42] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 2004.
 [43] T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton, D. Kim, C. Rhemann, I. Leichter, A. Vinnikov, Y. Wei, et al. Accurate, robust, and flexible realtime hand tracking. In ACM CHI, 2015.
 [44] T. Simon, H. Joo, I. Matthews, and Y. Sheikh. Hand keypoint detection in single images using multiview bootstrapping. In CVPR, 2017.

[45]
A. Sinha, C. Choi, and K. Ramani.
Deephand: Robust hand pose estimation by completing a matrix imputed with deep features.
In CVPR, 2016.  [46] J. Song, G. Sörös, F. Pece, S. R. Fanello, S. Izadi, C. Keskin, and O. Hilliges. Inair gestures around unmodified mobile devices. In ACM Symposium on User Interface Software and Technology, 2014.
 [47] A. Spurr, J. Song, S. Park, and O. Hilliges. Crossmodal deep variational hand pose estimation. In CVPR, 2018.
 [48] S. Sridhar, F. Mueller, A. Oulasvirta, and C. Theobalt. Fast and robust hand tracking using detectionguided optimization. In CVPR, 2015.
 [49] S. Sridhar, F. Mueller, M. Zollhoefer, D. Casas, A. Oulasvirta, and C. Theobalt. Realtime joint tracking of a hand manipulating an object from rgbd input. In ECCV, 2016.
 [50] S. Sridhar, H. Rhodin, H.P. Seidel, A. Oulasvirta, and C. Theobalt. Realtime hand tracking using a sum of anisotropic gaussians model. In 3DV, 2014.
 [51] B. Stenger, P. R. Mendonça, and R. Cipolla. Modelbased 3d tracking of an articulated hand. In CVPR, 2001.
 [52] X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun. Cascaded hand pose regression. In CVPR, 2015.
 [53] D. Tang, J. Taylor, P. Kohli, C. Keskin, T.K. Kim, and J. Shotton. Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In ICCV, 2015.
 [54] J. Taylor, R. Stebbing, V. Ramakrishna, C. Keskin, J. Shotton, S. Izadi, A. Hertzmann, and A. Fitzgibbon. Userspecific hand modeling from monocular depth sequences. In CVPR, 2014.
 [55] A. Tewari, M. Zollhöfer, P. Garrido, F. Bernard, H. Kim, P. Pérez, and C. Theobalt. Selfsupervised multilevel face model learning for monocular reconstruction at over 250 hz. In CVPR, 2018.
 [56] A. Tewari, M. Zollhöfer, H. Kim, P. Garrido, F. Bernard, P. Pérez, and C. Theobalt. Mofa: Modelbased deep convolutional face autoencoder for unsupervised monocular reconstruction. In iCCV, 2017.
 [57] A. Thayananthan, B. Stenger, P. H. Torr, and R. Cipolla. Shape context and chamfer matching in cluttered scenes. In CVPR, 2003.
 [58] A. Tkach, M. Pauly, and A. Tagliasacchi. Spheremeshes for realtime hand modeling and tracking. ACM TOG, 2016.
 [59] J. Tompson, M. Stein, Y. Lecun, and K. Perlin. Realtime continuous pose recovery of human hands using convolutional networks. ACM ToG, 2014.

[60]
H.Y. Tung, H.W. Tung, E. Yumer, and K. Fragkiadaki.
Selfsupervised learning of motion capture.
In NIPS, 2017.  [61] C. Wan, T. Probst, L. Van Gool, and A. Yao. Dense 3d regression for hand pose estimation. In CVPR, 2018.
 [62] X. Wu, D. Finnegan, E. O’Neill, and Y.L. Yang. Handmap: Robust hand pose estimation via intermediate dense guidance map supervision. In ECCV, 2018.
 [63] Y. Wu, J. Y. Lin, and T. S. Huang. Capturing natural hand articulation. In ICCV, 2001.
 [64] C. Xu and L. Cheng. Efficient hand pose estimation from a single depth image. In ICCV, 2013.
 [65] J. Zhang, J. Jiao, M. Chen, L. Qu, X. Xu, and Q. Yang. 3d hand pose tracking and estimation using stereo matching. arXiv preprint arXiv:1610.07214, 2016.
 [66] Y. Zhou, J. Lu, K. Du, X. Lin, Y. Sun, and X. Ma. Hbe: Hand branch ensemble network for realtime 3d hand pose estimation. In ECCV, 2018.

[67]
J.Y. Zhu, T. Park, P. Isola, and A. A. Efros.
Unpaired imagetoimage translation using cycleconsistent adversarial networks.
In ICCV, 2017.  [68] C. Zimmermann and T. Brox. Learning to estimate 3d hand pose from single rgb images. In ICCV, 2017.
Comments
There are no comments yet.