Accurate 3D hand pose estimation has received considerable attention regarding a wide range of applications, such as virtual/augmented reality and human-computer interaction . As commercial depth cameras have been released and become more common, depth-based hand pose estimation methods have attracted significant research interest in recent decades . Nevertheless, it is still a challenging problem to accurately estimate 3D hand pose, because of the low quality of depth images, large variations in hand orientations, high joint flexibility, and severe self-occlusion.
Recently, most 3D hand pose estimation methods have been based on convolutional neural networks (CNNs) with a single depth image. In these conventional CNN-based methods, there are two major approaches to improve estimation accuracy. The first involves 3D data representation of 2D depth images. To utilize 3D spatial information, Ge et al. and Moon et al. converted a depth image into a volumetric representation, such as 3D voxels, and then applied a 3D CNN for 3D hand pose estimation [11, 12, 23]. In addition, 3D representation of inputs based on a 3D point cloud has been proposed [4, 9, 13]. Although these methods are effective for capturing the geometric properties of depth images [11, 23], they suffer from heavy parameters and complex data-conversion processes, resulting in high time complexity. For efficient training and testing, 2D CNN-based methods that attempt to extract more information from 2D inputs are still being widely researched. In this study, we adopt a 2D depth image itself as input without a further data representation process to utilize the efficiency of 2D CNN.
The second approach involves effective network architecture designs that utilize the structural properties of hands [6, 21, 30, 39]. For a network model, hierarchical networks that divide the global hand pose estimation problem into sub-tasks have been proposed, where each one focuses on a specific finger or hand region. Sun et al.  divided global hand pose estimation into local estimations of palm and finger pose and then updated the finger location according to the palm pose in a cascade manner. Madadi et al.  designed a hierarchically tree-like, structured CNN using five branches to model the five fingers and an additional branch to model the palm orientation. Zhou et al.  designed a three-branch network, where the three branches correspond to the thumb, index finger, and the three other fingers, according to the differences in the functional importance of different fingers. More recently, Du et al.  proposed a two-branch cross-connection network that hierarchically regresses palm and finger poses through information-sharing in a multi-task setup. These studies have demonstrated that handling different parts of the hand via a multi-branch CNN can improve the accuracy of 3D hand pose estimation. However, these methods estimate all joints of the finger directly without explicitly considering finger kinematics. For a finger, the movements of different joints in the finger are dependent on each other and can be represented as a kinematic chain. To better capture spatial dependencies between adjacent joints, in our work, we adopt a recurrent neural network (RNN), which is mainly used to handle the sequential features of the joints in a finger.
Figure 1 depicts the inference time (ms) versus mean distance error (mm) graph of some state-of-the-art 3D hand pose estimation methods on NYU dataset. As can be seen in the figure, in terms of estimation accuracy, the proposed method is ranked in the 3rd place behind V2V-PoseNet  and Point-to-Point  which have much higher computational complexity than the proposed method. The proposed HCRNN operates approximately 70 and six times faster than V2V-PoseNet  and Point-to-Point , respectively. On the other hand, in terms of inference speed, the proposed HCRNN is in the 2nd place behind Feedback , while HCRNN improves the estimation accuracy of Feedback about 40%. In summary, the HCRNN achieves both goals, effectiveness and efficiency, in 3D hand pose estimation.
Figure 2 illustrates the proposed hierarchical convolutional RNN (HCRNN), which takes the 3D geometry of a hand into account for 3D hand pose estimation. The six separate branches are based on the observation that the hand is composed of six local parts (i.e. the palm and five fingers) with different amounts of variations due to the articulated structure of the hand. Inspired by the recent study in which sequential features of a sequence-like object are obtained in a single image , we first extract the sequential features of joints by applying the combination of convolutional and fully-connected (FC) layers to the feature from the encoder network. Then, we propose to make use of an RNN taking these joint features as inputs to capture the sequential information of a finger and to extract the interdependent information of the finger joints along the kinematic chain.
Our contributions can be summarized as follows:
We propose the HCRNN architecture that decomposes global hand estimation into sub-tasks of estimating the local parts of the hand. Based on the understanding that the palm and finger exhibit different flexibilities and degrees of freedom, the separate branches estimate the 3D positions of the palm and five fingers. We apply the RNN to utilize spatial information between the sequential-joints of the finger.
We design a relatively efficient network that not only achieves promising performance with the mean errors of 6.6, 9.4, and 7.8 on the ICVL , NYU , and MSRA  datasets, respectively, but also runs fast, at over 240 fps, on a single GPU. The speed versus accuracy trade-off graph can be seen in Figure 1.
2 Recurrent Neural Networks and Its Variants
RNNs learn a hidden representation for each time step of sequential data by considering both the current and previous information. Thanks to their ability to memorize and abstract the sequential information over time, RNN has achieved great success in sequential data modelings such as natural language processing (NLP) and speech recognition. Formally, the hidden state and the output feature at the current time step,, can be respectively obtained by
where , , and are the parameters for the hidden state and and are the parameters for the current output. This recurrent structure allows the RNN to convey the information in the past time steps to the current prediction process. However, as the time gap between information grows, the basic RNN cannot preserve temporal memories and faces the problem of long-term dependencies due to the vanishing gradient .
To tackle the aforementioned problem, long short-term memory (LSTM)
has recently been proposed, which replaces the nonlinear units of the basic RNN. Among the numerous variants of LSTM, the gated recurrent unit (GRU) is one of the most popular modules owing to its ability to reduce the complexity of LSTM. There are two key components in GRU, which are referred to as the update gate and reset gate. The update gate, , controls the balance between the previous and current feature information, while the reset gate, , is used to modulate the states of the previous hidden feature. At the time step , the current gate output is computed as follows:
where , , and terms with a specific subscript represent the parameters for each layer, respectively,
is a sigmoid function,is the current memory content, is a current hidden state, and represents element-wise multiplication. In this work, we adopt a GRU as a basic RNN module.
3 Proposed Method
3.1 Overall Network Architecture
illustrates the overall architecture of the proposed 3D hand pose estimation methods. The proposed network mainly consists of two parts: an encoder that transforms an input depth image onto the abstracted feature space; a joint regression sub-networks (SubNets) that are composed of six branches corresponding to five fingers and a palm. The input depth image is firstly fed into an encoder for low-level feature extraction. Then, the regression SubNets take the obtained feature map from the encoder as an input and predict the 3D pose of a palm and fingers.
|Layers||Kernel size||# channels||Output size|
3.2 Encoder Network
The encoder of the proposed 3D hand pose estimation method is based on ResNet  as described in Table 1. The encoder has five residual blocks, each of which consists of two convolutional layer. When the output dimension of the residual block is increased, we use a convolutional layer instead of the identity skip connection. Max-pooling layers for down-sampling are appended after each residual block except for the last two blocks. We design the encoder to be shallow because an input depth map has a plainer texture as compared with the inputs used in classification or segmentation tasks. Unless otherwise noted, the encoder takes the input with a spatial size of , and thus the output feature map has a spatial size of with 256 channels.
3.3 Regression SubNet
As the different parts of the hands have different amounts of variation and degrees of freedom (DoF) due to the articulated structure of the hand , it is inefficient directly regressing all parts together from the encoded feature. Among hand joints, the palm is much more stable than fingers and has the highest DoF which mainly affects global hand positions. The five fingers are largely independent and have more flexibility than palm . By decomposing the global hand pose estimation problem into sub-tasks of palm and finger pose estimation, the optimization of the network parameters can be simplified by limiting the search space [8, 29]. Based on the aforementioned properties of hands, we design our joint regression SubNet as hierarchically structured networks with six separate branches for the estimation of the palm and five fingers: thumb, index, middle, ring, and little finger.
3.3.1 Sequential modeling of finger joints
Although recent state-of-the-art 3D hand pose estimation methods mostly adopt discriminative learning-based methods as deep learning technology advances, model-based methods still have their own advantages[35, 36, 38]. As considering the joint connection over the finger, the movements of different joints of a finger are highly related to each other. With a finger root, i.e. metacarpophalangeal (MCP), as the base position, each finger is composed of sequential-joints, which can be represented as a kinematic chain. Using the kinematic structure of a hand, the model-based methods can constrain the solution space of the hand joint positions. To take the advantages of both the model- and discriminative-learning-based methods, we estimate a th joint of the th finger, , by using the feature sequence for the previous links as follows:
where is a mapping function from the feature sequence onto the 3D coordinate and , , is the abastracted feature of the th finger’s joint. Note that represents the feature of the MCP joint. Reflecting the kinematic structure of the finger, we design a regression sub-network (SubNet) of (8) by using a recurrent model in which the hidden layer containing the information of previous links controls the current estimation as follows:
where and are the current and previous hidden states, respectively, is a GRU, is the parameter of the GRU, and and
are the parameter for the linear regression.
3.3.2 Finger branch
Figure 3 depicts the RNN-based regression block in the finger branch. For a finger, each 3D joint position is recurrently estimated by using its input encoded feature and the previous joint information in the latent space. Inspired by the recent study where sequential features of a sequence-like object are obtained in a single image , we first extract the sequential features of joints from the output of the encoder network. To this end, we first apply a convolution followed by a global pooling to the feature map from the encoder for each branch in order to extract the hand-part-specific information. For example, let
be such a feature vector ofth finger branch. Taking as an input, different FC layers are employed to obtain the feature vectors of each joint, , where is the number of joints in the finger, as shown in Figure 3. To estimate the first joint of the finger , i.e. the MCP joint, and a zero vector as an initial hidden state is fed into the GRU as in (9) and (10). Then, the hidden state in the GRU conveys the sequential information to the end joint along the kinematic chain. Finally, the estimated 3D coordinate of joints for are concatenated to form a 3D pose of the th finger .
3.3.3 Palm branch
Like the finger branches, the hand-part-specific feature for the palm branch is first extracted. As mentioned at the beginning of this section, the palm is inflexible and more stable than fingers. Thus, the palm joint is directly estimated by applying a series of FC layers to the palm feature .
3.4 Loss Functions
We train the proposed network in an end-to-end manner by minimizing the smooth loss  defined as
and and are the ground truth and estimated 3D coordinates of th joint from the th branch. The smooth
loss is proven to be more robust to outliers than theloss and is widely used in regression problems such as detection and classification tasks [14, 16].
3.5 Implementation Details
As mentioned in the previous subsection, the input feature vectors of each branch are obtained by
convolutional layers. The channel number of the feature vectors after these layers are all set to 512. We add a batch normalization (BN) layer after each convolution layer to simplify the learning procedure and improve the generalization ability. The rectified linear unit (ReLU) is employed as an activation function after the convolutional and FC layers except for the last FC layer, which performs final joint regression. Following the strategies in[3, 15, 24, 26], we first extract a fixed-size cube from the depth image around the hand. A hand region is cropped from this bounding area and resized to a fixed size of . The depth values within the cropped region are normalized to [-1, 1]. The points for which the depth is outside the range of the cube are assigned a depth of 1. During training, we apply the following online data augmentation tricks: Random rotation in the range [-180, 180]; random translation of [-10, 10] pixels; random scaling of [0.9, 1.1]. We use the Adam optimizer 
with an initial learning rate of 1e-3, a batch size of 32, and a weight decay of 1e-5. The learning rate is divided by a factor of 0.96 at every 2k iterations. The entire network is trained for 120 epochs in an end-to-end manner. Our model is implemented by Tensorflow, and a single NVIDIA Titan X GPU (Pascal architecture) is used for training and testing. To design the branch details for different hand pose datasets, we define the joint subsets as shown in Figure 4.
4 Experimental Results
4.1 Datasets and Evaluation Metrics
4.1.1 MSRA dataset
The MSRA dataset  contains 76k frames from nine different subjects with 17 different gestures. This dataset was captured with Intel’s Creative Interactive Gesture Camera  and has 21 annotated joints, including 1 palm joint and four joints for each finger as shown in Figure 4(a). Following the most commonly used protocol , we used a leave-one-subject-out cross-validation strategy for evaluation on this dataset.
4.1.2 NYU dataset
The NYU dataset  was captured with three Microsoft Kinects. It contains 72k training and 8k testing images from three different views. The training set was collected from one subject, while the testing set was collected from two subjects. According to the evaluation protocol that most previous works follow, we used only a frontal view and a subset of 14 annotated joints which is depicted in Figure 4(b) for both training and testing.
4.1.3 ICVL dataset
The ICVL dataset  was captured with an Intel Realsense Camera. In this dataset, there are 22k frames from 10 different subjects for training and 1.5k images for testing. The training set includes an additional 300k augmented frames with in-plane rotations, but we did not use them because we applied online data augmentation during training, as described in Section 3.5. This dataset has 16 annotated joints, including 1 palm joint and 3 joints for each finger, as shown in Figure 4(c).
4.1.4 Evaluation metrics
To evaluate the performance of the different 3D hand pose estimation methods, we used two metrics. The first metric is the average 3D distance error between the ground truth and predicted 3D position for each joint. The second one is the percentage of succeeded frames whose errors for all joints are within a threshold.
|Method||Mean error (mm)||Input|
|Multi-view CNNs ||–||–||13.1||2D|
|REN (4x6x6) ||7.63||13.39||–||2D|
|REN (9x6x6) ||7.31||12.69||9.79||2D|
|3D CNN ||–||14.1||9.58||3D|
|3D DenseNet ||6.7||10.6||7.9||3D|
|Hand PointNet ||6.94||10.5||8.5||3D|
To analyze the contributions of each component of the proposed method, we conducted self-comparative experiments on the ICVL  dataset. First, we evaluated the effect of the independent finger joint estimation. For this experiment, as shown in Figure 5(a), we designed a two-branch network consisting of one branch for palm joint regression and the other one for unified finger joint regression, which is a similar approach in . In other words, in the finger branch of the two-branch network, the RNN-based regression block estimates recurrently the joints of the five fingers simultaneously. For a fair comparison with the proposed architecture, we adjusted the number of convolution channels and dimensions of FC layers in the finger branch of the two-branch network so that each network has a similar number of parameters. As shown in Figure 6, the proposed network architecture performs better and reduces the mean 3D distance error (mm) by 0.41 (from 6.99 to 6.58; see the last entry of the graph) as compared with the baseline network architecture. This result demonstrates that regressing all fingers jointly is inefficient because the five fingers are largely independent . By building a fine-grained branch for each finger, the network can learn richer features for finger pose estimation.
We also evaluated the effect of the RNN-based regression on finger joint estimation. We designed another network architecture, where the RNN-based regression block is replaced with two FC layers, as shown in Figure 5(b). The FC-layer-based network directly regresses all joints of the finger from the input finger features rather than sequentially estimating finger joints along the kinematic chain. As shown in Figure 6, the proposed RNN-based network architecture achieves a better result than the FC-layer-based network with direct regression and reduces the mean 3D distance error (mm) by 0.37 (from 6.95 to 6.58). These experiments confirm that the proposed RNN-based sequential regression can effectively estimate the spatially related finger joints along the kinematic chain.
|Method||Test speed (fps)||Input|
|3D DenseNet ||126||3D|
|3D CNN ||215||3D|
4.3 Comparision with State-of-the-art Methods
We compared the proposed network on three public 3D hand pose datasets with the most recently proposed methods using 2D depth maps as an input, including DISCO , DeepPrior , its improved version DeepPrior++ , Feedback , Multi-view CNNs , REN-4x6x6 , REN-9x6x6 , Pose-REN , Generalized , Global2Local , CrossingNets , HBE , and CrossInfoNet , as well as methods using 3D inputs, including 3D CNN , SHPR-Net , 3D DenseNet , HandPointNet , Point-to-Point , and V2V-PoseNet . The average 3D distance error per joint and percentage of success frames over different error thresholds are respectively shown in Table 2 and Figure 7. As can be seen, our method outperforms state-of-the-art methods with 2D inputs on all the three datasets. As compared with the methods using 3D inputs, our method performs better than 3DCNN , SHPR-Net , HandPointNet , and 3D DenseNe  and achieves comparable performance with Point-to-Point  on the ICVL and MSRA datasets. On the NYU dataset, the results of the proposed method are worse than those of V2V-PoseNet  but are better in terms of percentage of success frame rates when the error threshold is larger than 30mm. On the MSRA dataset, following the evaluation protocol of prior works [3, 11, 30], we also measured the mean joint error over various viewpoint angles. As shown in Figure 8, our method can obtain promising results from large yaw and pitch angles, which demonstrates the robustness of our proposed method to viewpoint changes, which is a challenging problem in hand pose estimation. The qualitative results of our method on three datasets are shown in Figure 9. It can be seen that our method can accurately estimate 3D hand joint locations on the three datasets.
The inference speed is also an important factor for the practical application of 3D hand pose estimation. Table 3 compares the inference speed of conventional and the proposed methods on a single GPU. While top-ranked methods using 3D inputs have a higher inference time owing to the time-consuming 3D convolution operation or data conversion procedure, our method has a faster inference speed owing to its efficiency of 2D CNN-based architecture. The proposed HCRNN is ranked in 2nd place among the compared methods behind Feedback. Based on the aforementioned results, it can be seen that our proposed HCRNN not only achieves competitive performance compared with state-of-the-art methods but also is very efficient, having a high frame rate, which shows the applicability to real-time applications.
To design a practical architecture for 3D hand pose estimation, we considered the articulated structure of the hand and proposed an efficient regression network, namely termed HCRNN. The proposed HCRNN has a hierarchical architecture, where six separate branches are trained to estimate the position of each local part of the hand: the palm and five fingers. In each finger branch, we adopted an RNN to model spatial dependencies between the sequential-joints of the finger. In addition, HCRNN is built on a 2D CNN that directly takes 2D depth images as inputs, making it more efficient than 3D CNN-based methods. The experimental results showed that the proposed HCRNN not only achieves competitive performance compared with state-of-the-art methods but also has a highly efficient running speed of 240 fps on a single GPU.
Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §3.5.
-  (2016) Disco nets: dissimilarity coefficients networks. In Advances in Neural Information Processing Systems, pp. 352–360. Cited by: §4.3, Table 2.
-  (2019) Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing. Cited by: §3.5, §4.3, Table 2.
-  (2018) Shpr-net: deep semantic hand pose regression from point clouds. IEEE Access 6, pp. 43425–43439. Cited by: §1, §4.3, Table 2.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §2.
CrossInfoNet: multi-task information sharing based hand pose estimation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9896–9905. Cited by: §1, §4.2, §4.3, Table 2, Table 3.
-  (2007) Vision-based hand pose estimation: a review. Computer Vision and Image Understanding 108 (1-2), pp. 52–73. Cited by: §1.
-  (2008) Progressive search space reduction for human pose estimation. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §3.3.
-  (2018) Hand pointnet: 3d hand pose estimation using point sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8417–8426. Cited by: §1, §4.3, Table 2, Table 3.
-  (2016) Robust 3d hand pose estimation in single depth images: from single-view cnn to multi-view cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3593–3601. Cited by: §4.3, Table 2.
-  (2017) 3d convolutional neural networks for efficient and robust hand pose estimation from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1991–2000. Cited by: §1, §4.3, Table 2, Table 3.
-  (2018) Real-time 3d hand pose estimation with 3d convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence 41 (4), pp. 956–970. Cited by: §1, §4.3, Table 2, Table 3.
-  (2018) Point-to-point regression pointnet for 3d hand pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 475–491. Cited by: §1, §1, §4.3, Table 2, Table 3.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §3.4.
-  (2017) Region ensemble network: improving convolutional network for hand pose estimation. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 4512–4516. Cited by: §3.5, §4.3, Table 2.
-  (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1904–1916. Cited by: §3.4.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.2.
-  (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A field guide to dynamical recurrent neural networks. IEEE Press. Cited by: §2.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.5.
-  (2017) End-to-end global to local cnn learning for hand pose recovery in depth data. arXiv preprint arXiv:1705.09606. Cited by: §1, §4.3, Table 2.
-  (2013) Dynamics based 3d skeletal hand tracking. In Proceedings of Graphics Interface 2013, pp. 63–70. Cited by: §4.1.1.
-  (2018) V2v-posenet: voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5079–5088. Cited by: §1, §1, §4.3, Table 2, Table 3.
-  (2017) Deepprior++: improving fast and accurate 3d hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 585–594. Cited by: §3.5, §4.3, Table 2, Table 3.
-  (2015) Hands deep in deep learning for hand pose estimation. arXiv preprint arXiv:1502.06807. Cited by: §4.3, Table 2.
-  (2015) Training a feedback loop for hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3316–3324. Cited by: §1, §3.5, §4.3, Table 2, Table 3.
-  (2019) Generalized feedback loop for joint hand-object pose estimation. IEEE transactions on pattern analysis and machine intelligence. Cited by: §4.3, Table 2, Table 3.
-  (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence 39 (11), pp. 2298–2304. Cited by: §1, §3.3.2.
-  (2016) . In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4150–4158. Cited by: §3.3.
-  (2015) Cascaded hand pose regression. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 824–832. Cited by: 2nd item, §1, §3.3, §4.1.1, §4.1, §4.2, §4.3.
-  (2014) Latent regression forest: structured estimation of 3d articulated hand posture. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3786–3793. Cited by: 2nd item, §4.1.3, §4.1, §4.2.
-  (2014) Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics (ToG) 33 (5), pp. 169. Cited by: 2nd item, §4.1.2, §4.1.
-  (2017) Crossing nets: combining gans and vaes with a shared latent space for hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 680–689. Cited by: §4.3, Table 2, Table 3.
-  (2018) Region ensemble network: towards good practices for deep 3d hand pose estimation. Journal of Visual Communication and Image Representation 55, pp. 404–414. Cited by: §4.3, Table 2.
-  (2018) Model-based hand pose estimation for generalized hand shape with appearance normalization. arXiv preprint arXiv:1807.00898. Cited by: §3.3.1.
-  (2001) Capturing natural hand articulation. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vol. 2, pp. 426–432. Cited by: §3.3.1.
-  (2018) Depth-based 3d hand pose estimation: from current achievements to future goals. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2636–2645. Cited by: §1.
-  (2016) Model-based deep hand pose estimation. arXiv preprint arXiv:1606.06854. Cited by: §3.3.1.
-  (2018) Hbe: hand branch ensemble network for real-time 3d hand pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 501–516. Cited by: §1, §4.3, Table 2.