Capturing Hand Articulations using Recurrent Neural Network for 3D Hand Pose Estimation

11/18/2019 ∙ by Cheol-hwan Yoo, et al. ∙ Korea University 0

3D hand pose estimation from a single depth image plays an important role in computer vision and human-computer interaction. Although recent hand pose estimation methods using convolution neural network (CNN) have shown notable improvements in accuracy, most of them have a limitation that they rely on a complex network structure without fully exploiting the articulated structure of the hand. A hand, which is an articulated object, is composed of six local parts: the palm and five independent fingers. Each finger consists of sequential-joints that provide constrained motion, referred to as a kinematic chain. In this paper, we propose a hierarchically-structured convolutional recurrent neural network (HCRNN) with six branches that estimate the 3D position of the palm and five fingers independently. The palm position is predicted via fully-connected layers. Each sequential-joint, i.e. finger position, is obtained using a recurrent neural network (RNN) to capture the spatial dependencies between adjacent joints. HCRNN directly takes the depth map as an input without a time-consuming data conversion, such as 3D voxels and point clouds. Experimental results on public datasets demonstrate that the proposed HCRNN not only outperforms the 2D CNN-based methods using the depth image as their inputs but also achieves competitive results with state-of-the-art 3D CNN-based methods with a highly efficient running speed of 240 fps on a single GPU.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Performance/inference speed trade-off on NYU dataset. The proposed HCRNN has advantages both in estimation accuracy and fast inference speed.
Figure 2: The overall architecture of the proposed method for 3D hand pose estimation. It is designed with a hierarchical structure, where each separate branch estimates the 3D positions of the palm and five fingers. In the finger branches, the RNN-based regression block is added to capture the spatial dependencies between the sequential-joints of the finger.

Accurate 3D hand pose estimation has received considerable attention regarding a wide range of applications, such as virtual/augmented reality and human-computer interaction [7]. As commercial depth cameras have been released and become more common, depth-based hand pose estimation methods have attracted significant research interest in recent decades [37]. Nevertheless, it is still a challenging problem to accurately estimate 3D hand pose, because of the low quality of depth images, large variations in hand orientations, high joint flexibility, and severe self-occlusion.

Recently, most 3D hand pose estimation methods have been based on convolutional neural networks (CNNs) with a single depth image. In these conventional CNN-based methods, there are two major approaches to improve estimation accuracy. The first involves 3D data representation of 2D depth images. To utilize 3D spatial information, Ge et al. and Moon et al. converted a depth image into a volumetric representation, such as 3D voxels, and then applied a 3D CNN for 3D hand pose estimation [11, 12, 23]. In addition, 3D representation of inputs based on a 3D point cloud has been proposed [4, 9, 13]. Although these methods are effective for capturing the geometric properties of depth images [11, 23], they suffer from heavy parameters and complex data-conversion processes, resulting in high time complexity. For efficient training and testing, 2D CNN-based methods that attempt to extract more information from 2D inputs are still being widely researched. In this study, we adopt a 2D depth image itself as input without a further data representation process to utilize the efficiency of 2D CNN.

The second approach involves effective network architecture designs that utilize the structural properties of hands [6, 21, 30, 39]. For a network model, hierarchical networks that divide the global hand pose estimation problem into sub-tasks have been proposed, where each one focuses on a specific finger or hand region. Sun et al[30] divided global hand pose estimation into local estimations of palm and finger pose and then updated the finger location according to the palm pose in a cascade manner. Madadi et al[21] designed a hierarchically tree-like, structured CNN using five branches to model the five fingers and an additional branch to model the palm orientation. Zhou et al[39] designed a three-branch network, where the three branches correspond to the thumb, index finger, and the three other fingers, according to the differences in the functional importance of different fingers. More recently, Du et al[6] proposed a two-branch cross-connection network that hierarchically regresses palm and finger poses through information-sharing in a multi-task setup. These studies have demonstrated that handling different parts of the hand via a multi-branch CNN can improve the accuracy of 3D hand pose estimation. However, these methods estimate all joints of the finger directly without explicitly considering finger kinematics. For a finger, the movements of different joints in the finger are dependent on each other and can be represented as a kinematic chain. To better capture spatial dependencies between adjacent joints, in our work, we adopt a recurrent neural network (RNN), which is mainly used to handle the sequential features of the joints in a finger.

Figure 1 depicts the inference time (ms) versus mean distance error (mm) graph of some state-of-the-art 3D hand pose estimation methods on NYU dataset. As can be seen in the figure, in terms of estimation accuracy, the proposed method is ranked in the 3rd place behind V2V-PoseNet [23] and Point-to-Point [13] which have much higher computational complexity than the proposed method. The proposed HCRNN operates approximately 70 and six times faster than V2V-PoseNet [23] and Point-to-Point [13], respectively. On the other hand, in terms of inference speed, the proposed HCRNN is in the 2nd place behind Feedback [26], while HCRNN improves the estimation accuracy of Feedback about 40%. In summary, the HCRNN achieves both goals, effectiveness and efficiency, in 3D hand pose estimation.

Figure 2 illustrates the proposed hierarchical convolutional RNN (HCRNN), which takes the 3D geometry of a hand into account for 3D hand pose estimation. The six separate branches are based on the observation that the hand is composed of six local parts (i.e. the palm and five fingers) with different amounts of variations due to the articulated structure of the hand. Inspired by the recent study in which sequential features of a sequence-like object are obtained in a single image [28], we first extract the sequential features of joints by applying the combination of convolutional and fully-connected (FC) layers to the feature from the encoder network. Then, we propose to make use of an RNN taking these joint features as inputs to capture the sequential information of a finger and to extract the interdependent information of the finger joints along the kinematic chain.

Our contributions can be summarized as follows:

  • We propose the HCRNN architecture that decomposes global hand estimation into sub-tasks of estimating the local parts of the hand. Based on the understanding that the palm and finger exhibit different flexibilities and degrees of freedom, the separate branches estimate the 3D positions of the palm and five fingers. We apply the RNN to utilize spatial information between the sequential-joints of the finger.

  • We design a relatively efficient network that not only achieves promising performance with the mean errors of 6.6, 9.4, and 7.8 on the ICVL [31], NYU [32], and MSRA [30] datasets, respectively, but also runs fast, at over 240 fps, on a single GPU. The speed versus accuracy trade-off graph can be seen in Figure 1.

2 Recurrent Neural Networks and Its Variants

RNNs learn a hidden representation for each time step of sequential data by considering both the current and previous information. Thanks to their ability to memorize and abstract the sequential information over time, RNN has achieved great success in sequential data modelings such as natural language processing (NLP) and speech recognition. Formally, the hidden state and the output feature at the current time step,

, can be respectively obtained by

(1)
(2)

where , , and are the parameters for the hidden state and and are the parameters for the current output. This recurrent structure allows the RNN to convey the information in the past time steps to the current prediction process. However, as the time gap between information grows, the basic RNN cannot preserve temporal memories and faces the problem of long-term dependencies due to the vanishing gradient [18].

To tackle the aforementioned problem, long short-term memory (LSTM) 

[19]

has recently been proposed, which replaces the nonlinear units of the basic RNN. Among the numerous variants of LSTM, the gated recurrent unit (GRU) 

[5] is one of the most popular modules owing to its ability to reduce the complexity of LSTM. There are two key components in GRU, which are referred to as the update gate and reset gate. The update gate, , controls the balance between the previous and current feature information, while the reset gate, , is used to modulate the states of the previous hidden feature. At the time step , the current gate output is computed as follows:

(3)
(4)
(5)
(6)
(7)

where , , and terms with a specific subscript represent the parameters for each layer, respectively,

is a sigmoid function,

is the current memory content, is a current hidden state, and represents element-wise multiplication. In this work, we adopt a GRU as a basic RNN module.

3 Proposed Method

3.1 Overall Network Architecture

Figure 2

illustrates the overall architecture of the proposed 3D hand pose estimation methods. The proposed network mainly consists of two parts: an encoder that transforms an input depth image onto the abstracted feature space; a joint regression sub-networks (SubNets) that are composed of six branches corresponding to five fingers and a palm. The input depth image is firstly fed into an encoder for low-level feature extraction. Then, the regression SubNets take the obtained feature map from the encoder as an input and predict the 3D pose of a palm and fingers.

Layers Kernel size # channels Output size
Residual block 64 96
Max pooling 64 48
Residual block 64 48
Max pooling 64 24
Residual block 128 24
Max pooling 128 12
Residual block 256 12
Residual block 256 12
Table 1: Detailed architecture of the encoder network for initial feature extraction.

3.2 Encoder Network

The encoder of the proposed 3D hand pose estimation method is based on ResNet [17] as described in Table 1. The encoder has five residual blocks, each of which consists of two convolutional layer. When the output dimension of the residual block is increased, we use a convolutional layer instead of the identity skip connection. Max-pooling layers for down-sampling are appended after each residual block except for the last two blocks. We design the encoder to be shallow because an input depth map has a plainer texture as compared with the inputs used in classification or segmentation tasks. Unless otherwise noted, the encoder takes the input with a spatial size of , and thus the output feature map has a spatial size of with 256 channels.

3.3 Regression SubNet

As the different parts of the hands have different amounts of variation and degrees of freedom (DoF) due to the articulated structure of the hand [30], it is inefficient directly regressing all parts together from the encoded feature. Among hand joints, the palm is much more stable than fingers and has the highest DoF which mainly affects global hand positions. The five fingers are largely independent and have more flexibility than palm [30]. By decomposing the global hand pose estimation problem into sub-tasks of palm and finger pose estimation, the optimization of the network parameters can be simplified by limiting the search space [8, 29]. Based on the aforementioned properties of hands, we design our joint regression SubNet as hierarchically structured networks with six separate branches for the estimation of the palm and five fingers: thumb, index, middle, ring, and little finger.

3.3.1 Sequential modeling of finger joints

Although recent state-of-the-art 3D hand pose estimation methods mostly adopt discriminative learning-based methods as deep learning technology advances, model-based methods still have their own advantages 

[35, 36, 38]. As considering the joint connection over the finger, the movements of different joints of a finger are highly related to each other. With a finger root, i.e. metacarpophalangeal (MCP), as the base position, each finger is composed of sequential-joints, which can be represented as a kinematic chain. Using the kinematic structure of a hand, the model-based methods can constrain the solution space of the hand joint positions. To take the advantages of both the model- and discriminative-learning-based methods, we estimate a th joint of the th finger, , by using the feature sequence for the previous links as follows:

(8)

where is a mapping function from the feature sequence onto the 3D coordinate and , , is the abastracted feature of the th finger’s joint. Note that represents the feature of the MCP joint. Reflecting the kinematic structure of the finger, we design a regression sub-network (SubNet) of (8) by using a recurrent model in which the hidden layer containing the information of previous links controls the current estimation as follows:

(9)
(10)

where and are the current and previous hidden states, respectively, is a GRU, is the parameter of the GRU, and and

are the parameter for the linear regression.

Figure 3: Architecture of the RNN-based regression block, which takes a feature sequence of a finger as input. It conveys the feature information of the spatially adjacent joints, which are utilized for estimating the position of sequential finger joints.

3.3.2 Finger branch

Figure 3 depicts the RNN-based regression block in the finger branch. For a finger, each 3D joint position is recurrently estimated by using its input encoded feature and the previous joint information in the latent space. Inspired by the recent study where sequential features of a sequence-like object are obtained in a single image [28], we first extract the sequential features of joints from the output of the encoder network. To this end, we first apply a convolution followed by a global pooling to the feature map from the encoder for each branch in order to extract the hand-part-specific information. For example, let

be such a feature vector of

th finger branch. Taking as an input, different FC layers are employed to obtain the feature vectors of each joint, , where is the number of joints in the finger, as shown in Figure 3. To estimate the first joint of the finger , i.e. the MCP joint, and a zero vector as an initial hidden state is fed into the GRU as in (9) and (10). Then, the hidden state in the GRU conveys the sequential information to the end joint along the kinematic chain. Finally, the estimated 3D coordinate of joints for are concatenated to form a 3D pose of the th finger .

3.3.3 Palm branch

Like the finger branches, the hand-part-specific feature for the palm branch is first extracted. As mentioned at the beginning of this section, the palm is inflexible and more stable than fingers. Thus, the palm joint is directly estimated by applying a series of FC layers to the palm feature .

3.4 Loss Functions

We train the proposed network in an end-to-end manner by minimizing the smooth loss [14] defined as

(11)

where

(12)

and and are the ground truth and estimated 3D coordinates of th joint from the th branch. The smooth

loss is proven to be more robust to outliers than the

loss and is widely used in regression problems such as detection and classification tasks [14, 16].

Figure 4: Subset of joints on the MSRA, NYU and ICVL datasets. Violet color indicates the palm joints subset, and other colors indicate the finger joints subset.

3.5 Implementation Details

As mentioned in the previous subsection, the input feature vectors of each branch are obtained by

convolutional layers. The channel number of the feature vectors after these layers are all set to 512. We add a batch normalization (BN) layer after each convolution layer to simplify the learning procedure and improve the generalization ability. The rectified linear unit (ReLU) is employed as an activation function after the convolutional and FC layers except for the last FC layer, which performs final joint regression. Following the strategies in 

[3, 15, 24, 26], we first extract a fixed-size cube from the depth image around the hand. A hand region is cropped from this bounding area and resized to a fixed size of . The depth values within the cropped region are normalized to [-1, 1]. The points for which the depth is outside the range of the cube are assigned a depth of 1. During training, we apply the following online data augmentation tricks: Random rotation in the range [-180, 180]; random translation of [-10, 10] pixels; random scaling of [0.9, 1.1]. We use the Adam optimizer [20]

with an initial learning rate of 1e-3, a batch size of 32, and a weight decay of 1e-5. The learning rate is divided by a factor of 0.96 at every 2k iterations. The entire network is trained for 120 epochs in an end-to-end manner. Our model is implemented by Tensorflow 

[1], and a single NVIDIA Titan X GPU (Pascal architecture) is used for training and testing. To design the branch details for different hand pose datasets, we define the joint subsets as shown in Figure 4.

Figure 5: Baseline architectures for self-comparisons. a) Two-branch baseline network with one branch for palm and the other branch for finger joint regression. b) Baseline network where the RNN-based regression block is replaced with fully connected layers.
Figure 6: Self-comparison results on 3D distance errors (mm) per hand joint.

4 Experimental Results

4.1 Datasets and Evaluation Metrics

We evaluated our network on the following three public hand pose datasets: ICVL [31], NYU [32], and MSRA [30].

4.1.1 MSRA dataset

The MSRA dataset [30] contains 76k frames from nine different subjects with 17 different gestures. This dataset was captured with Intel’s Creative Interactive Gesture Camera [22] and has 21 annotated joints, including 1 palm joint and four joints for each finger as shown in Figure 4(a). Following the most commonly used protocol [30], we used a leave-one-subject-out cross-validation strategy for evaluation on this dataset.

4.1.2 NYU dataset

The NYU dataset [32] was captured with three Microsoft Kinects. It contains 72k training and 8k testing images from three different views. The training set was collected from one subject, while the testing set was collected from two subjects. According to the evaluation protocol that most previous works follow, we used only a frontal view and a subset of 14 annotated joints which is depicted in Figure 4(b) for both training and testing.

4.1.3 ICVL dataset

The ICVL dataset [31] was captured with an Intel Realsense Camera. In this dataset, there are 22k frames from 10 different subjects for training and 1.5k images for testing. The training set includes an additional 300k augmented frames with in-plane rotations, but we did not use them because we applied online data augmentation during training, as described in Section 3.5. This dataset has 16 annotated joints, including 1 palm joint and 3 joints for each finger, as shown in Figure 4(c).

4.1.4 Evaluation metrics

To evaluate the performance of the different 3D hand pose estimation methods, we used two metrics. The first metric is the average 3D distance error between the ground truth and predicted 3D position for each joint. The second one is the percentage of succeeded frames whose errors for all joints are within a threshold.

Method Mean error (mm) Input
ICVL NYU MSRA
Multi-view CNNs [10] 13.1 2D
DISCO [2] 20.7 2D
DeepPrior [25] 10.4 19.73 2D
Feedback [26] 15.97 2D
Global2Local [21] 15.60 12.8 2D
CrossingNets [33] 10.2 15.5 12.2 2D
HBE [39] 8.62 2D
REN (4x6x6) [15] 7.63 13.39 2D
REN (9x6x6) [34] 7.31 12.69 9.79 2D
DeepPrior++ [24] 8.1 12.24 9.5 2D
Pose-REN [3] 6.79 11.81 8.65 2D
Generalized [27] 10.89 2D
CrossInfoNet [6] 6.73 10.07 7.86 2D
HCRNN (Ours) 6.58 9.41 7.77 2D
3D CNN [11] 14.1 9.58 3D
SHPR-Net [4] 7.22 10.78 7.96 3D
3D DenseNet [12] 6.7 10.6 7.9 3D
Hand PointNet [9] 6.94 10.5 8.5 3D
Point-to-Point [13] 6.33 9.04 7.71 3D
V2V-PoseNet [23] 6.28 8.42 7.49 3D
Table 2: Comparison of the proposed method with state-of-the-art methods on three 3D hand pose datasets. Mean error indicates the average 3D distance error.
Figure 7: Comparison with state-of-the-art methods. Top row: percentage of success frames over different error thresholds. Bottom row: 3D distance errors per hand joint. Left: ICVL dataset, Center: NYU dataset, Right: MSRA dataset.

4.2 Self-comparisons

To analyze the contributions of each component of the proposed method, we conducted self-comparative experiments on the ICVL [31] dataset. First, we evaluated the effect of the independent finger joint estimation. For this experiment, as shown in Figure 5(a), we designed a two-branch network consisting of one branch for palm joint regression and the other one for unified finger joint regression, which is a similar approach in [6]. In other words, in the finger branch of the two-branch network, the RNN-based regression block estimates recurrently the joints of the five fingers simultaneously. For a fair comparison with the proposed architecture, we adjusted the number of convolution channels and dimensions of FC layers in the finger branch of the two-branch network so that each network has a similar number of parameters. As shown in Figure 6, the proposed network architecture performs better and reduces the mean 3D distance error (mm) by 0.41 (from 6.99 to 6.58; see the last entry of the graph) as compared with the baseline network architecture. This result demonstrates that regressing all fingers jointly is inefficient because the five fingers are largely independent [30]. By building a fine-grained branch for each finger, the network can learn richer features for finger pose estimation.

We also evaluated the effect of the RNN-based regression on finger joint estimation. We designed another network architecture, where the RNN-based regression block is replaced with two FC layers, as shown in Figure 5(b). The FC-layer-based network directly regresses all joints of the finger from the input finger features rather than sequentially estimating finger joints along the kinematic chain. As shown in Figure 6, the proposed RNN-based network architecture achieves a better result than the FC-layer-based network with direct regression and reduces the mean 3D distance error (mm) by 0.37 (from 6.95 to 6.58). These experiments confirm that the proposed RNN-based sequential regression can effectively estimate the spatially related finger joints along the kinematic chain.

Figure 8: Comparison of mean error distance over different yaw (left) and pitch (right) viewpoint angles on MSRA dataset.
Figure 9: Qualitative results on the three public datasets. Left: ICVL dataset. Center: NYU dataset. Right: MSRA dataset. The ground truth is shown as red lines, and the prediction is shown as green lines.
Method Test speed (fps) Input
V2V-PoseNet [23] 3.5 3D
Point-to-Point [13] 41.8 3D
HandPointNet [9] 48 3D
3D DenseNet [12] 126 3D
3D CNN [11] 215 3D
DeepPrior++ [24] 30 2D
Generalized [27] 40 2D
CrossingNets [33] 90.9 2D
CrossInfoNet [6] 124.5 2D
Feedback [26] 400 2D
HCRNN (ours) 240 2D
Table 3: Comparison of inference speed with state-of-the art methods. The inference speed is measured with a single GPU.

4.3 Comparision with State-of-the-art Methods

We compared the proposed network on three public 3D hand pose datasets with the most recently proposed methods using 2D depth maps as an input, including DISCO [2], DeepPrior [25], its improved version DeepPrior++ [24], Feedback [26], Multi-view CNNs [10], REN-4x6x6 [15], REN-9x6x6 [34], Pose-REN [3], Generalized [27], Global2Local [21], CrossingNets [33], HBE [39], and CrossInfoNet [6], as well as methods using 3D inputs, including 3D CNN [11], SHPR-Net [4], 3D DenseNet [12], HandPointNet [9], Point-to-Point [13], and V2V-PoseNet [23]. The average 3D distance error per joint and percentage of success frames over different error thresholds are respectively shown in Table 2 and Figure 7. As can be seen, our method outperforms state-of-the-art methods with 2D inputs on all the three datasets. As compared with the methods using 3D inputs, our method performs better than 3DCNN [11], SHPR-Net [4], HandPointNet [9], and 3D DenseNe [12] and achieves comparable performance with Point-to-Point [13] on the ICVL and MSRA datasets. On the NYU dataset, the results of the proposed method are worse than those of V2V-PoseNet [23] but are better in terms of percentage of success frame rates when the error threshold is larger than 30mm. On the MSRA dataset, following the evaluation protocol of prior works [3, 11, 30], we also measured the mean joint error over various viewpoint angles. As shown in Figure 8, our method can obtain promising results from large yaw and pitch angles, which demonstrates the robustness of our proposed method to viewpoint changes, which is a challenging problem in hand pose estimation. The qualitative results of our method on three datasets are shown in Figure 9. It can be seen that our method can accurately estimate 3D hand joint locations on the three datasets.

4.4 Runtime

The inference speed is also an important factor for the practical application of 3D hand pose estimation. Table 3 compares the inference speed of conventional and the proposed methods on a single GPU. While top-ranked methods using 3D inputs have a higher inference time owing to the time-consuming 3D convolution operation or data conversion procedure, our method has a faster inference speed owing to its efficiency of 2D CNN-based architecture. The proposed HCRNN is ranked in 2nd place among the compared methods behind Feedback. Based on the aforementioned results, it can be seen that our proposed HCRNN not only achieves competitive performance compared with state-of-the-art methods but also is very efficient, having a high frame rate, which shows the applicability to real-time applications.

5 Conclusion

To design a practical architecture for 3D hand pose estimation, we considered the articulated structure of the hand and proposed an efficient regression network, namely termed HCRNN. The proposed HCRNN has a hierarchical architecture, where six separate branches are trained to estimate the position of each local part of the hand: the palm and five fingers. In each finger branch, we adopted an RNN to model spatial dependencies between the sequential-joints of the finger. In addition, HCRNN is built on a 2D CNN that directly takes 2D depth images as inputs, making it more efficient than 3D CNN-based methods. The experimental results showed that the proposed HCRNN not only achieves competitive performance compared with state-of-the-art methods but also has a highly efficient running speed of 240 fps on a single GPU.

References

  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning

    .
    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §3.5.
  • [2] D. Bouchacourt, P. K. Mudigonda, and S. Nowozin (2016) Disco nets: dissimilarity coefficients networks. In Advances in Neural Information Processing Systems, pp. 352–360. Cited by: §4.3, Table 2.
  • [3] X. Chen, G. Wang, H. Guo, and C. Zhang (2019) Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing. Cited by: §3.5, §4.3, Table 2.
  • [4] X. Chen, G. Wang, C. Zhang, T. Kim, and X. Ji (2018) Shpr-net: deep semantic hand pose regression from point clouds. IEEE Access 6, pp. 43425–43439. Cited by: §1, §4.3, Table 2.
  • [5] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §2.
  • [6] K. Du, X. Lin, Y. Sun, and X. Ma (2019) CrossInfoNet: multi-task information sharing based hand pose estimation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 9896–9905. Cited by: §1, §4.2, §4.3, Table 2, Table 3.
  • [7] A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and X. Twombly (2007) Vision-based hand pose estimation: a review. Computer Vision and Image Understanding 108 (1-2), pp. 52–73. Cited by: §1.
  • [8] V. Ferrari, M. Marin-Jimenez, and A. Zisserman (2008) Progressive search space reduction for human pose estimation. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §3.3.
  • [9] L. Ge, Y. Cai, J. Weng, and J. Yuan (2018) Hand pointnet: 3d hand pose estimation using point sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8417–8426. Cited by: §1, §4.3, Table 2, Table 3.
  • [10] L. Ge, H. Liang, J. Yuan, and D. Thalmann (2016) Robust 3d hand pose estimation in single depth images: from single-view cnn to multi-view cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3593–3601. Cited by: §4.3, Table 2.
  • [11] L. Ge, H. Liang, J. Yuan, and D. Thalmann (2017) 3d convolutional neural networks for efficient and robust hand pose estimation from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1991–2000. Cited by: §1, §4.3, Table 2, Table 3.
  • [12] L. Ge, H. Liang, J. Yuan, and D. Thalmann (2018) Real-time 3d hand pose estimation with 3d convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence 41 (4), pp. 956–970. Cited by: §1, §4.3, Table 2, Table 3.
  • [13] L. Ge, Z. Ren, and J. Yuan (2018) Point-to-point regression pointnet for 3d hand pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 475–491. Cited by: §1, §1, §4.3, Table 2, Table 3.
  • [14] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §3.4.
  • [15] H. Guo, G. Wang, X. Chen, C. Zhang, F. Qiao, and H. Yang (2017) Region ensemble network: improving convolutional network for hand pose estimation. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 4512–4516. Cited by: §3.5, §4.3, Table 2.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1904–1916. Cited by: §3.4.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.
  • [18] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, et al. (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A field guide to dynamical recurrent neural networks. IEEE Press. Cited by: §2.
  • [19] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
  • [20] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.5.
  • [21] M. Madadi, S. Escalera, X. Baró, and J. Gonzalez (2017) End-to-end global to local cnn learning for hand pose recovery in depth data. arXiv preprint arXiv:1705.09606. Cited by: §1, §4.3, Table 2.
  • [22] S. Melax, L. Keselman, and S. Orsten (2013) Dynamics based 3d skeletal hand tracking. In Proceedings of Graphics Interface 2013, pp. 63–70. Cited by: §4.1.1.
  • [23] G. Moon, J. Yong Chang, and K. Mu Lee (2018) V2v-posenet: voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5079–5088. Cited by: §1, §1, §4.3, Table 2, Table 3.
  • [24] M. Oberweger and V. Lepetit (2017) Deepprior++: improving fast and accurate 3d hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 585–594. Cited by: §3.5, §4.3, Table 2, Table 3.
  • [25] M. Oberweger, P. Wohlhart, and V. Lepetit (2015) Hands deep in deep learning for hand pose estimation. arXiv preprint arXiv:1502.06807. Cited by: §4.3, Table 2.
  • [26] M. Oberweger, P. Wohlhart, and V. Lepetit (2015) Training a feedback loop for hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3316–3324. Cited by: §1, §3.5, §4.3, Table 2, Table 3.
  • [27] M. Oberweger, P. Wohlhart, and V. Lepetit (2019) Generalized feedback loop for joint hand-object pose estimation. IEEE transactions on pattern analysis and machine intelligence. Cited by: §4.3, Table 2, Table 3.
  • [28] B. Shi, X. Bai, and C. Yao (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39 (11), pp. 2298–2304. Cited by: §1, §3.3.2.
  • [29] A. Sinha, C. Choi, and K. Ramani (2016)

    Deephand: robust hand pose estimation by completing a matrix imputed with deep features

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4150–4158. Cited by: §3.3.
  • [30] X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun (2015) Cascaded hand pose regression. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 824–832. Cited by: 2nd item, §1, §3.3, §4.1.1, §4.1, §4.2, §4.3.
  • [31] D. Tang, H. Jin Chang, A. Tejani, and T. Kim (2014) Latent regression forest: structured estimation of 3d articulated hand posture. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3786–3793. Cited by: 2nd item, §4.1.3, §4.1, §4.2.
  • [32] J. Tompson, M. Stein, Y. Lecun, and K. Perlin (2014) Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics (ToG) 33 (5), pp. 169. Cited by: 2nd item, §4.1.2, §4.1.
  • [33] C. Wan, T. Probst, L. Van Gool, and A. Yao (2017) Crossing nets: combining gans and vaes with a shared latent space for hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 680–689. Cited by: §4.3, Table 2, Table 3.
  • [34] G. Wang, X. Chen, H. Guo, and C. Zhang (2018) Region ensemble network: towards good practices for deep 3d hand pose estimation. Journal of Visual Communication and Image Representation 55, pp. 404–414. Cited by: §4.3, Table 2.
  • [35] J. Wöhlke, S. Li, and D. Lee (2018) Model-based hand pose estimation for generalized hand shape with appearance normalization. arXiv preprint arXiv:1807.00898. Cited by: §3.3.1.
  • [36] Y. Wu, J. Y. Lin, and T. S. Huang (2001) Capturing natural hand articulation. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vol. 2, pp. 426–432. Cited by: §3.3.1.
  • [37] S. Yuan, G. Garcia-Hernando, B. Stenger, G. Moon, J. Yong Chang, K. Mu Lee, P. Molchanov, J. Kautz, S. Honari, L. Ge, et al. (2018) Depth-based 3d hand pose estimation: from current achievements to future goals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2636–2645. Cited by: §1.
  • [38] X. Zhou, Q. Wan, W. Zhang, X. Xue, and Y. Wei (2016) Model-based deep hand pose estimation. arXiv preprint arXiv:1606.06854. Cited by: §3.3.1.
  • [39] Y. Zhou, J. Lu, K. Du, X. Lin, Y. Sun, and X. Ma (2018) Hbe: hand branch ensemble network for real-time 3d hand pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 501–516. Cited by: §1, §4.3, Table 2.