GraphPoseGAN: 3D Hand Pose Estimation from a Monocular RGB Image via Adversarial Learning on Graphs

12/04/2019 ∙ by Yiming He, et al. ∙ 0

This paper addresses the problem of 3D hand pose estimation from a monocular RGB image. We are the first to propose a graph-based generative adversarial learning framework regularized by a hand model, aiming at realistic 3D hand pose estimation. Our model consists of a 3D hand pose generator and a multi-source discriminator. Taking one monocular RGB image as the input, the generator is essentially a residual graph convolution module with a parametric deformable hand model as prior for pose refinement. Further, we design a multi-source discriminator with hand poses, bones and the input image as input to capture intrinsic features, which distinguishes the predicted 3D hand pose from the ground-truth and leads to anthropomorphically valid hand poses. In addition, we propose two novel bone-constrained loss functions, which characterize the morphable structure of hand poses explicitly. Extensive experiments demonstrate that our model sets new state-of-the-art performances in 3D hand pose estimation from a monocular image on standard benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

3D human hand pose estimation is a long-standing problem in computer vision, which is critical for applications such as virtual reality and augmented reality

[17, 30]. Previous works attempt to estimate hand pose from depth images [13, 42, 52, 12] or in multi-view setups [29, 49]. However, due to the diversity and complexity of hand shape, pose, gesture, occlusion, etc., it still remains a challenging problem despite years of studies [31, 46, 47, 16].

Given that RGB cameras are more widely accessible than depth sensors, recent works focus mostly on 3D hand pose estimation from a monocular RGB image [14, 4, 3, 5, 53]. Moreover, [14, 4] resort to extra datasets for network pretraining, which lead to state-of-the-art performance. Nevertheless, there still remain many challenges. Firstly, compared to the depth input, this task has increased depth and scale ambiguities. Secondly, unlike bodies and faces that have obvious local characteristics such as eyes on a face, hands exhibit almost uniform appearance. Further, the input RGB image usually contains external occlusion and self-occlusion due to motion. Consequently, estimated hand poses from existing methods are sometimes unrealistic.

Fig. 1: The proposed GraphPoseGAN estimates 3D hand pose from a monocular image. Firstly a hand model generates a prior pose, which is then fed into a GCN refinement module for pose refinement with adversarial training approach, leading to state-of-the-art hand pose estimation.

To this end, we propose a hand-model regularized graph convolutional network trained under a generative adversarial learning framework (GraphPoseGAN), which aims to estimate the real distribution of 3D hand poses. Adversarial learning has shown its efficiency in 3D human pose estimation and generation [45, 8, 40]. Inspired by such works, we present a novel paradigm of generative adversarial networks with a hand-model prior to estimate realistic hand poses. Taking a monocular image as the input, our generator consists of a Graph Convolutional Network (GCN) refinement module regularized by a parametric hand model that serves as a good starting point. This hand model generates a template 3D hand pose, which resides on irregular grids. Hence, we represent the template 3D hand pose naturally on graphs and refine it via a GCN. In particular, motivated by recent advances in GCNs [20, 41, 51, 6, 21], we present a refinement module leveraging on the recently proposed ResGCN [21], which efficiently exploits the structural relationship among nodes in the constructed graph.

Further, we design a multi-source discriminator, which employs hand poses, hand bones computed from poses as well as the input image to distinguish the predicted 3D hand pose from the ground-truth, leading to anthropomorphically valid hand pose. The input poses and bones go through a GCN and a fully-connected layer respectively to learn structural features of 3D hand pose, while the input image goes through a CNN to extract 2D hand pose features. These features are then concatenated and fed into a fully-connected layer to acquire the final score. Moreover, while most existing works [4, 14, 5] deploy 3D Euclidean distance between joints as the loss function for 3D annotation, we propose two novel loss functions constraining the bones connected by joints to preserve hand structure explicitly, which are referred to as bone-constrained loss functions. Experimental results demonstrate that our model outperforms state-of-the-art approaches on standard benchmarks.

To summarize, our main contributions include

  • We propose a graph-convolutional generative adversarial learning paradigm for 3D hand pose estimation from a monocular image, where a parametric hand model serves as a superior prior for pose refinement.

  • We are the first to estimate 3D hand pose via adversarial training in literature, which is able to learn a real distribution of 3D hand poses.

  • We introduce two novel bone-constrained loss functions, which take the structural relationship between joints into consideration explicitly.

Ii Related Work

According to the input modalities, 3D hand pose estimation methods can be classified into three categories: 1) 3D hand pose estimation from depth images; 2) 3D hand pose estimation from multiple RGB images; 3) 3D hand pose estimation from a monocular RGB image.

3D hand pose estimation from depth images. Depth images contain rich 3D information [37] for hand pose estimation, which has shown promising accuracy [48]. There is a rich literature on 3D hand pose estimation with depth images as input [12, 13, 11, 9, 10, 19, 43, 23, 26]. Among them, some works [10, 19] are based on a deformable hand model with an iterative optimization training approach. A recent work [23] leverages CNN to learn the shape and pose parameters for a proposed model (LBS hand model).

3D hand pose estimation from multiple images. Multiple RGB images taken from different views also contain rich 3D information. Therefore, some works take multiple images as input [7, 28, 35] to alleviate the occlusion problem. Campos et al. [7] propose a regression-based approach for hand pose estimation, where they utilize multi-view images to overcome the occlusion issue. Sridhar et al. [35] contribute a fundamentally extended generative tracking algorithm based on an augmented implicit shape representation with multiple images as input.

3D hand pose estimation from a monocular RGB image. Compared with the aforementioned two categories, a monocular RGB image is more accessible. Early works [2, 32, 36] propose complex model-fitting approaches, which are always based on dynamics and multiple hypotheses and depend on restricted requirements. These model-fitting approaches have proposed many hand models, based on assembled geometric primitives [27] or sphere meshes [38], etc. Our work deploys the MANO hand model [33]

as our prior, which models both hand shape and pose as well as generates meshes. Nevertheless, these sophisticated approaches suffer from low estimation accuracy. With the advance of deep learning, many recent works estimate 3D hand pose from a monocular RGB image using neural networks

[14, 4, 3, 5, 53, 44]. Among them, [5] proposes an end-to-end learning-based 3D hand pose estimation model for weakly-supervised adaptation from fully-annotated synthetic images to weakly-labeled real-world images. Ge et al. [14] propose to estimate vertices of 3D meshes from GCNs [20] in order to learn nonlinear hand shape variations.

Iii Overview of the Proposed Approach

Fig. 2: Architecture of the proposed graph-based generative adversarial networks for 3D hand pose estimation.

We propose a regularized Generative Adversarial Network (GAN) paradigm, which consists of a generator and a discriminator , as illustrated in Fig. 2.

Given a monocular RGB image as the input, the generator includes two modules:

  1. The first module—hand model module—generates a template 3D hand pose as a prior for the subsequent module, which consists of an encoder and a parametric hand model. The encoder extracts the latent code from the input image as the parameters of the hand model. Subsequently, the hand model generates the template 3D hand pose from the latent code.

  2. The second module—GCN refinement module—aims at pose refinement. Taking and as the input, the GCN refinement module outputs the deformation of the 3D hand pose, leading to the refined pose . Alternately, since the hand model module also generates a template 3D hand mesh , we can choose as the input to the GCN refinement module, and estimate the refined hand pose .

The multi-source discriminator distinguishes the ground-truth 3D poses from the predicted ones. In this module, we employ three input sources: 1) the 3D hand pose ; 2) the original image ; and 3) the hand bone matrix, which is generated from the 3D hand pose.

Having designed and , we adopt the framework of SNGAN [24] for the adversarial training of the entire network with good convergence. Our training process is divided into three stages. Firstly, we only train the hand model module on the dataset and save model parameters. Thereafter, we load the parameters and train the entire generator on the dataset. Finally, we go through the adversarial training stage.

Iv The Proposed Generator

Iv-a The hand model module

Given an input monocular image, this module aims to generate a template 3D hand pose , which serves as prior for the subsequent hand pose refinement. The hand model module consists of an encoder and a parametric hand model, which will be discussed in detail.

Iv-A1 Encoder

Fig. 3: Architecture of the encoder in Generator Module I. Taking a monocular RGB image as the input, the encoder adopts a ResNet-50 [15] network to learn the latent code .

The goal of the encoder is to predict parameters of the hand model. Specifically, as illustrated in Fig. 3, we employ ResNet-50 [15] as the encoder to efficiently estimate model parameters. The input is a monocular RGB image, which is cropped and resized to a saliency region of the hand in the image. The cropped image is then fed into the ResNet-50 network, which extracts features for the construction of the latent code , i.e., parameters of the hand model.

Iv-A2 Parametric Hand Model

A hand model is able to represent both hand shape and pose with a few parameters, which is a suitable prior for hand pose estimation. In particular, we employ a modified MANO hand model [33], which is based on the SMPL model [22]

for human bodies. The MANO hand model is a deformable hand mesh model with two vectors

and contained in the latent code as the input, which control the pose and shape of the generated hand respectively. The output of the original MANO includes a hand mesh and pose .

In the original MANO, is a 6-dimensional vector which represents the PCA subspace for computational efficiency. However, the limited dimension of is unable to represent some poses. Hence, we skip the PCA process and employ the initial 45-dimensional vector .

Besides, we need three additional parameters to position the mesh in a camera coordinate system so that we can get the 3D coordinates of each point in the hand mesh and pose: 1) 3D rotation parameter ; 2) 3D translation parameter ; and 3) scale parameter .

The above three parameters represent the camera coordinate system. Hence, we formulate the complete hand model as:

(1)

where is a rotation function. The original MANO model regresses the 3D hand pose from the 3D hand mesh based on a set of parameters . Thus, the pose regressor in our complete hand model is defined as

(2)

where is the set of parameters derived from the MANO hand model.

Iv-B The GCN Refinement Module

Having acquired a template 3D hand pose as a prior, the GCN refinement module aims to refine the hand pose in a supervised fashion. Since key points in 3D hand pose reside on irregular grids, it is natural to represent them on a graph and employ GCN [20]

for pose refinement. In particular, this module consists of two steps: 2D image feature extraction and residual graph convolution.

Fig. 4: Architecture of the GCN refinement module in our generator.

2D Image Feature Extraction Since the input RGB image contains prominent information about the hand pose, we first extract features from . Specifically, we employ a typical image-based CNN following the ResNet-50 architecture [15] to extract 2D features from .

Residual Graph Convolution We represent the irregular template hand pose on unweighted graphs, where each key point on the hand is treated as a node in the graph and nodes are connected according to the human hand structure. Then we refine the hand pose using residual graph convolution.

Specifically, we leverage the design of the Bottleneck residual block [15] and the recent ResGCN [21], which has shown that graph residual learning makes graph convolution networks deeper and perform better. Thus, we stack Graph Res-block in our network, as shown in Fig. 4.

Instead of directly generating a new hand pose, we take the template pose generated by the hand model module as a prior, and learn the deformation from as follows:

(3)

where denotes the refined pose, denotes the concatenation operation, and denotes the 2D image feature vector. represents the deformation between the prior and the estimate . Supervised with ground truth hand pose, the GCN refinement module essentially learns the pose deformation and thus refines the prior pose.

V The Proposed Discriminator

Fig. 5: Architecture of the multi-source discriminator. The discriminator contains three sources: 1) the original image; 2) the refined hand pose; 3) the hand bone matrix computed from pose. The three sources are separately embedded and then concatenated to distinguish a fake hand pose from the ground truth.

In the adversarial training stage, while the generator is learned to generate predicted hand poses which are indistinguishable for the discriminator, the discriminator attempts to distinguish real samples from fake ones, i.e., the predicted hand poses.

While a simple architecture of a discriminator is a fully-connected network with the hand pose as input, it has two shortcomings: 1) the relationship between the RGB image and refined hand pose is neglected; 2) structural properties of the hand pose are not taken into account explicitly.

To this end, inspired by the multi-source architecture in [45], we design a multi-source discriminator with three data modalities as input to address the aforementioned issues. As illustrated in Fig. 5, the input modalities include: 1) features of the input monocular image; 2) features of the refined hand pose; 3) features of bones computed from the refined hand pose (via the KCS layer as in [39]). The bone features contain prominent structural information such as bone length and direction, thus characterizing the hand structure accurately. In particular, we employ a CNN to extract the features of the input monocular image, a GCN to learn the representation of the refined hand pose and one fully-connected layer to capture the features of bone structures computed from the refined hand pose.

Besides, the architecture of our multi-source discriminator is based on SNGAN [24] with spectral normalization layers. The loss of the discriminator is a Wasserstein loss function as in [1].

Vi The Proposed Loss Functions

Fig. 6: Illustration of the residual between the ground truth hand pose (marked in green) and the predicted one (marked in red). Each hand pose has 21 key joints. We denote a bone vector connecting two key joints and by , such as in the figure.

The loss function in previous hand pose estimation methods measures the average distance in joints of 3D hand pose and that of projected 2D hand pose [4, 14, 5], which we refer to as and respectively.

In addition, in order to capture the structural properties of hand pose, we propose two novel bone-constrained loss functions to characterize the length and direction of each bone, which are denoted as and .

In particular, quantifies the distance in bone length between the ground truth hand and its estimate, which we define as

(4)

where and are the bone vectors connecting joints and of the ground truth and the predicted 3D bone respectively. Specifically, as illustrated in Fig. 6, denotes a bone vector between joint and :

(5)

where are the coordinates of joint and respectively.

Moreover, we define to measure loss in the direction of bones:

(6)

This is motivated by the fact that small loss in joints sometimes may not reflect large distortion in hand pose. Take joints and in Fig. 6 as an example, the distance between the ground truth joints and predicted ones is trivial. However, it is obvious that the orientation of the predicted bone significantly deviates from the ground truth . This distortion in hand structure is well captured by our proposed loss in bone direction .

Besides, because we adopt the framework of GANs, we also introduce the Wasserstein loss into the loss function for adversarial training.

Hence, the overall loss function is

(7)

where , , and

are hyperparameters for the trade-off among these losses.

Vii Experimental Results

(a) Our results on STB (using mesh GCN)
(b) Our results on STB (using pose GCN)
(c) Our results on RHD (using mesh GCN)
(d) Our results on RHD (using pose GCN)
Fig. 7: Self-comparisons of 3D hand pose estimation on the STB dataset and RHD dataset.

We evaluate the three stages in our training process as well as normalization methods (Batch Normalization (BN) or Group Normalization (GN)).

Vii-a Datasets and Metrics

Datasets We evaluate our approach on two publicly available datasets: Stereo Hand Pose Tracking Benchmark (STB) [50] and the Rendered Hand Pose Dataset (RHD) [54] .

STB is a real-world dataset with image resolution of . STB contains two subsets: one subset is captured by a Point Grey Bumblebee2 stereo camera (STB-BB) and the other is captured by an active depth camera (STB-SK). Following the setting in previous works [5], we only use the STB-SK subset which contains 18,000 images with the ground truth of 21 hand joint locations. Following [54], we split the 18,000 images into 15,000 training samples and 3,000 test samples. During training, we crop and resize the images into . Besides, to make the joint definition consistent with our settings and RHD dataset, we move the location of the root joint from palm center to wrist following [14].

RHD is a synthetic dataset with image resolution of , which is built upon 20 different characters performing 39 actions and is composed of 41,258 images for training and 2,728 images for testing. All samples are annotated with 2D and 3D key point locations. Compared to STB, RHD is more challenging due to the large variations in viewpoints. We also crop and resize the images into during training.

Metrics We evaluate the 3D hand pose estimation performance with two metrics: (i) Pose error: the average error in Euclidean space between the estimated 3D joints and the ground truth joints; (ii) percentage of correct key points (PCK): the percentage of correct key points whose Euclidean error distance is below a threshold.

Vii-B Implementation Details

The input of our network is a monocular RGB image, which is cropped and resized to . For the hyperparameters, we set in our implementation. In our preliminary experiments, we found the end-to-end training of the entire network from scratch does not lead to good results. The difficulty of training is due to the complex dependencies among different modules of our network and the highly non-linear property of each module. We speculate that a good initialization would ease the difficulty of training. Thus, we propose a stage-by-stage training paradigm, which consists of three stages.

Stage I.

In the first stage, the hand model module is randomly initialized and is trained for 100 epochs using the Adam optimizer with learning rate 0.001. After training, the encoder of the hand model module predicts reasonable parameters for a given hand image, from which a coarse hand pose is regressed from the MANO hand model.

Stage II. In the second stage, the entire generator is trained with 100 epochs using the Adam optimizer with learning rate 0.0001. The hand model module is initialized with the trained model in the first stage and the GCN refinement module is randomly initialized. The finally predicted hand pose is much finer than that of stage I after training the entire generator.

Stage III. In the third stage, we fine-tune the entire generator along with the discriminator to minimize the adversarial loss. We initialize the entire generator using the trained model in the second stage and fine-tune the model with 100 epochs using the Adam optimizer with learning rate 0.00001.

width= Stage I Stage II Stage III mesh pose mesh pose Batch Normalization 96.5268 16.6879 16.5744 16.4026 15.3232 Group Normalization 83.3669 14.5440 15.8356 12.9770 12.4036

TABLE I: 3D Euclidean distance (mm) on the RHD dataset

width= Stage I Stage II Stage III mesh pose mesh pose Batch Normalization 24.1534 8.6444 5.1153 4.2410 3.9664 Group Normalization 26.4234 12.0623 11.6627 5.3062 4.2305

TABLE II: 3D Euclidean distance (mm) on the STB dataset

Vii-C Ablation Studies

We study the performance of our model with different normalizations and loss functions under different stages. Also, as described in Section III, we can employ the template pose or the template mesh as the prior of the subsequent GCN refinement module. We thus evaluate the performance using both priors.

Note that, similar to the template pose as prior, when taking the template mesh as input to the GCN refinement module, the GCN learns the deformation from the template mesh and estimates the refined mesh as:

(8)

The pose regressor in (2) is then employed to compute the refined pose from the refined mesh . Note that, because shares the same number of nodes and similar structure with generated from the MANO hand model, we reuse the MANO hand model in the first module of the generator for pose regression.

Vii-C1 Ablation studies on different stages

In Tab. I and Tab. II, we compare the results of three training stages in average 3D Euclidean distance, where (1) Batch Normalization / Group Normalization denote the type of normalizations in our model; (2) mesh / pose denote the type of the template prior. In Fig. 7, we present the 3D PCK results of our proposed method.

As presented in Tab. I, Tab. II and Fig. 7, the performance of Stage II (mesh or pose) is superior to that of Stage I, which indicates our proposed GCN refinement module is beneficial to 3D hand pose estimation and plays the most critical role in our model. The adversarial training approach (Stage III) further improves the result, by learning a real distribution of the 3D hand pose.

In addition, we see that the template pose leads to better performance than the template mesh, due to the lack of 3D hand mesh supervision. Besides, the two normalization strategies perform differently on STB and RHD dataset: batch normalization performs better on STB dataset while group normalization performs better on RHD dataset.

Vii-C2 Ablation studies on loss functions

Fig. 8: Ablation studies on the proposed bone-constrained loss functions at three stages.
Fig. 9: Qualitative results for the evaluation of the proposed bone-constrained loss functions.
(a) compared with state-of-the-art method on STB dataset
(b) compared with state-of-the-art method on RHD dataset
Fig. 10: Comparisons with state-of-the-art methods on RHD dataset and STB dataset.

We also evaluate the proposed bone-constrained loss functions. Taking the architecture of mesh+Group Normalization as an example, we train the network with our proposed loss functions and without them on the STB dataset respectively. As illustrated in Fig. 8, the network trained with our proposed bone-constrained loss functions performs better in all three stages. This gives credits to the bone constraints that take structural properties of human hands into consideration. Further, we show the visual comparison of estimated poses with and without bone losses in Fig. 9. We observe that the estimated pose may have unnatural distortion in bone directions when the bone-constrained loss functions are not applied, e.g., the little finger in the first row and the thumb in the second row. In contrast, our results have natural structure in the orientation of bones with the proposed bone constraints enforced.

Vii-D Experimental Results and Qualitative Results

Fig. 11: Qualitative results of our proposed network. The 2D pose is projected from the 3D pose.

We compare our method with competitive 3D hand pose estimation approaches on RHD and STB datasets, as presented in Fig. 10. On the STB dataset, as shown in Fig. 10(a), we compare with latest methods [5, 18, 4, 14, 25, 53]. Our paradigm achieves comparable performance with the state-of-the-art [14], which closely reaches the upper bound of 3D PCK at all the error thresholds. This is because the STB dataset is relatively simple. On the RHD dataset, as presented in Fig. 10(b), we compare with recent approaches [14, 5, 34, 53], and significantly outperform the state-of-the-art [14] by on average over all the error thresholds, validating the superiority of our method even on such challenging dataset.

Fig. 12: Qualitative results of different stages.

Moreover, we present some qualitative results of our 3D hand pose estimation in Fig. 11. The generated poses are accurate and natural even in case of severe self-occlusions, as shown in the first three columns of Fig. 11. This validates the effectiveness of the proposed pose prior and generative adversarial learning framework. We also show visual results of our method at different stages in Fig. 12. We see that Stage I estimates a coarse hand pose from the MANO hand model as a starting point, while Stage II refines the initial pose greatly. Finally, Stage III generates more realistic hand poses via adversarial learning.

Viii Conclusion

In this paper, we propose a hand-model regularized graph convolution network under an adversarial training framework. To the best of our knowledge, we are the first to exploit a hand model to compensate for the lack of prior knowledge in literature. Further, we present a novel three-stage adversarial training paradigm to generate a real distribution of the 3D hand pose, and propose bone-constrained loss functions to enforce natural hand structures. Experiments demonstrate that we set the new state-of-the-art performances, validating the superiority of the pose prior and adversarial training paradigm.

References

  • [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Doina Precup and Yee Whye Teh, editors,

    Proceedings of the 34th International Conference on Machine Learning

    , volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • [2] Vassilis Athitsos and Stan Sclaroff. Estimating 3d hand pose from a cluttered image. In

    IEEE Computer Society Conference on Computer Vision & Pattern Recognition

    , 2003.
  • [3] Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. Pushing the envelope for rgb-based dense 3d hand pose estimation via neural rendering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [4] Adnane Boukhayma, Rodrigo de Bem, and Philip H.S. Torr. 3d hand shape and pose from images in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [5] Yujun Cai, Liuhao Ge, Jianfei Cai, and Junsong Yuan. Weakly-supervised 3d hand pose estimation from monocular rgb images. In The European Conference on Computer Vision (ECCV), September 2018.
  • [6] Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, Junsong Yuan, and Nadia Magnenat Thalmann. Exploiting spatial-temporal relationships for 3d pose estimation via graph convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [7] T. E. De Campos and D. W. Murray. Regression-based hand pose estimation from multiple cameras. In IEEE Computer Society Conference on Computer Vision & Pattern Recognition, 2006.
  • [8] Yu Chen, Chunhua Shen, Xiu-Shen Wei, Lingqiao Liu, and Jian Yang. Adversarial posenet: A structure-aware convolutional network for human pose estimation. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [9] Chiho Choi.

    Deephand: Robust hand pose estimation by completing a matrix imputed with deep features.

    In Computer Vision & Pattern Recognition, 2016.
  • [10] La Gorce M De, D. J. Fleet, and N Paragios. Model-based 3d hand pose estimation from monocular video. IEEE Trans Pattern Anal Mach Intell, 33(9):1793–1805, 2011.
  • [11] Andrew Fitzgibbon. Accurate, robust, and flexible real-time hand tracking. Inproceedings, pages 3633–3642, 2015.
  • [12] Liuhao Ge, Yujun Cai, Junwu Weng, and Junsong Yuan. Hand pointnet: 3d hand pose estimation using point sets. pages 8417–8426, 06 2018.
  • [13] Liuhao Ge, Hui Liang, Junsong Yuan, and Daniel Thalmann. Robust 3d hand pose estimation in single depth images: From single-view cnn to multi-view cnns. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [14] Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingying Wang, Jianfei Cai, and Junsong Yuan. 3d hand shape and pose estimation from a single rgb image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [16] Liang Hui, Junsong Yuan, Jun Lee, Liuhao Ge, and Daniel Thalmann. Hough forest with optimized leaves for global hand pose estimation with arbitrary postures. IEEE Transactions on Cybernetics, PP(99):1–15, 2017.
  • [17] Wolfgang Hürst and Casper van Wezel. Gesture-based interaction via finger tracking for mobile augmented reality. Multimedia Tools and Applications, 62:233–258, 2011.
  • [18] Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall, and Jan Kautz. Hand pose estimation via latent 2.5d heatmap regression. In The European Conference on Computer Vision (ECCV), September 2018.
  • [19] Sameh Khamis, Jonathan Taylor, Jamie Shotton, Cem Keskin, Shahram Izadi, and Andrew Fitzgibbon. Learning an efficient model of hand shape variation from depth images. In IEEE Conference on Computer Vision & Pattern Recognition, 2015.
  • [20] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
  • [21] Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. Deepgcns: Can gcns go as deep as cnns? In The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [22] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. Smpl: A skinned multi-person linear model. ACM Trans. Graph., 34(6):248:1–248:16, Oct. 2015.
  • [23] Jameel Malik, Ahmed Elhayek, Fabrizio Nunnari, Kiran Varanasi, and Didier Stricker. Deephps: End-to-end estimation of 3d hand pose and shape by learning from synthetic depth. In 2018 International Conference on 3D Vision (3DV), 2018.
  • [24] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
  • [25] Franziska Mueller, Florian Bernard, Oleksandr Sotnychenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, and Christian Theobalt. Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [26] Markus Oberweger and Vincent Lepetit. Deepprior++: Improving fast and accurate 3d hand pose estimation. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), 2018.
  • [27] Iason Oikonomidis, Nikolaos Kyriazis, and Antonis Argyros. Efficient model-based 3d tracking of hand articulations using kinect. volume 1, 01 2011.
  • [28] Iasonas Oikonomidis, Nikolaos Kyriazis, and Antonis A. Argyros. Markerless and efficient 26-dof hand pose recovery. In Computer Vision-accv -asian Conference on Computer Vision, 2010.
  • [29] Paschalis Panteleris and Antonis A. Argyros. Back to RGB: 3d tracking of hands and hand-object interactions based on short-baseline stereo. CoRR, abs/1705.05301, 2017.
  • [30] Thammathip Piumsomboon, Adrian Clark, Mark Billinghurst, and Andy Cockburn. User-defined gestures for augmented reality. In Paula Kotzé, Gary Marsden, Gitte Lindgaard, Janet Wesson, and Marco Winckler, editors, Human-Computer Interaction – INTERACT 2013, pages 282–299, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
  • [31] James M. Rehg and Takeo Kanade. Visual tracking of high dof articulated structures: An application to human hand tracking. In Proc of Third European Conference on Computer Vision, 1994.
  • [32] J. M. Rehg and T. Kanade. Digiteyes: vision-based hand tracking for human-computer interaction. In IEEE Workshop on Motion of Non-rigid & Articulated Objects, 2002.
  • [33] Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Trans. Graph., 36(6):245:1–245:17, Nov. 2017.
  • [34] Adrian Spurr, Jie Song, Seonwook Park, and Otmar Hilliges. Cross-modal deep variational hand pose estimation. CoRR, abs/1803.11404, 2018.
  • [35] S. Sridhar, H. Rhodin, H. P. Seidel, A. Oulasvirta, and C. Theobalt. Real-time hand tracking using a sum of anisotropic gaussians model. In International Conference on 3d Vision, 2014.
  • [36] B. Stenger, A. Thayananthan, P. H. S. Torr, and R. Cipolla. Model-based hand tracking using a hierarchical bayesian filter. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9):1372–1384, Sep. 2006.
  • [37] Danhang Tang, Tsz Ho Yu, and Tae Kyun Kim. Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In IEEE International Conference on Computer Vision, 2013.
  • [38] Anastasia Tkach, Mark Pauly, and Andrea Tagliasacchi. Sphere-meshes for real-time hand modeling and tracking. ACM Trans. Graph., 35(6):222:1–222:11, Nov. 2016.
  • [39] Bastian Wandt, Hanno Ackermann, and Bodo Rosenhahn. A kinematic chain space for monocular motion capture. 02 2017.
  • [40] Bastian Wandt and Bodo Rosenhahn. Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [41] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018.
  • [42] Xiaokun Wu, Daniel Finnegan, Eamonn O’Neill, and Yong-Liang Yang. Handmap: Robust hand pose estimation via intermediate dense guidance map supervision. In The European Conference on Computer Vision (ECCV), September 2018.
  • [43] Sun Xiao, Yichen Wei, Liang Shuang, Xiaoou Tang, and Sun Jian. Cascaded hand pose regression. In IEEE Conference on Computer Vision & Pattern Recognition, 2015.
  • [44] Linlin Yang and Angela Yao. Disentangling latent hands for image synthesis and pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [45] Wei Yang, Wanli Ouyang, Xiaolong Wang, Jimmy Ren, Hongsheng Li, and Xiaogang Wang. 3d human pose estimation in the wild by adversarial learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [46] Wu Ying and T. S. Huang. Hand modeling, analysis and recognition. IEEE Signal Processing Magazine, 18(3):51–60, 2002.
  • [47] Wu Ying, Lin John, and Thomas S Huang. Analyzing and capturing articulated hand motion in image sequences. IEEE Transactions on Pattern Analysis & Machine Intelligence, 27(12):1910–1922, 2005.
  • [48] Shanxin Yuan, Guillermo Garcia-Hernando, Bjorn Stenger, Gyeongsik Moon, Ju Chang, Kyoung Lee, Pavlo Molchanov, Jan Kautz, Sina Honari, Liuhao Ge, Junsong Yuan, Xinghao Chen, Guijin Wang, Fan Yang, Kai Akiyama, Yang Wu, Qingfu Wan, Meysam Madadi, Sergio Escalera, and Tae-Kyun Kim. Depth-based 3d hand pose estimation: From current achievements to future goals. pages 2636–2645, 06 2018.
  • [49] Jiawei Zhang, Jianbo Jiao, Mingliang Chen, Liangqiong Qu, Xiaobin Xu, and Qingxiong Yang. 3d hand pose tracking and estimation using stereo matching. 10 2016.
  • [50] Jiawei Zhang, Jianbo Jiao, Mingliang Chen, Liangqiong Qu, Xiaobin Xu, and Qingxiong Yang. 3d hand pose tracking and estimation using stereo matching. ArXiv, abs/1610.07214, 2016.
  • [51] Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris N. Metaxas. Semantic graph convolutional networks for 3d human pose regression. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [52] Yidan Zhou, Jian Lu, Kuo Du, Xiangbo Lin, Yi Sun, and Xiaohong Ma. Hbe: Hand branch ensemble network for real-time 3d hand pose estimation. In The European Conference on Computer Vision (ECCV), September 2018.
  • [53] Christian Zimmermann and Thomas Brox. Learning to estimate 3d hand pose from single rgb images. pages 4913–4921, 10 2017.
  • [54] Christian Zimmermann and Thomas Brox. Learning to estimate 3d hand pose from single rgb images. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.