Log In Sign Up

End-to-end Global to Local CNN Learning for Hand Pose Recovery in Depth data

by   Meysam Madadi, et al.

Despite recent advances in 3D pose estimation of human hands, especially thanks to the advent of CNNs and depth cameras, this task is still far from being solved. This is mainly due to the highly non-linear dynamics of fingers, which makes hand model training a challenging task. In this paper, we exploit a novel hierarchical tree-like structured CNN, in which branches are trained to become specialized in predefined subsets of hand joints, called local poses. We further fuse local pose features, extracted from hierarchical CNN branches, to learn higher order dependencies among joints in the final pose by end-to-end training. Lastly, the loss function used is also defined to incorporate appearance and physical constraints about doable hand motion and deformation. Experimental results suggest that feeding a tree-shaped CNN, specialized in local poses, into a fusion network for modeling joints correlations, helps to increase the precision of final estimations, outperforming state-of-the-art results on NYU and MSRA datasets.


Structure-Aware 3D Hourglass Network for Hand Pose Estimation from Single Depth Image

In this paper, we propose a novel structure-aware 3D hourglass network f...

End-to-End Differentiable 6DoF Object Pose Estimation with Local and Global Constraints

Inferring the 6DoF pose of an object from a single RGB image is an impor...

HandAugment: A Simple Data Augmentation for HANDS19 Challenge Task 1 – Depth-Based 3D Hand Pose Estimation

Hand pose estimation from 3D depth images, has been explored widely usin...

Silhouette-Net: 3D Hand Pose Estimation from Silhouettes

3D hand pose estimation has received a lot of attention for its wide ran...

Capturing Hand Articulations using Recurrent Neural Network for 3D Hand Pose Estimation

3D hand pose estimation from a single depth image plays an important rol...

Vision-based Teleoperation of Shadow Dexterous Hand using End-to-End Deep Neural Network

In this paper, we present TeachNet, a novel neural network architecture ...

1 Introduction

Recently, hand pose recovery attracted special attention thanks to the availability of low cost depth cameras, like Microsoft Kinect [4, 11, 15, 23, 21, 29, 27, 12, 30, 28]. Unsurprisingly, 3D hand pose estimation plays an important role in most HCI application scenarios, like social robotics and virtual immersive environments [24].

Despite impressive pose estimation improvements thanks to the use of CNNs and depth cameras, 3D hand pose recovery still faces some challenges before becoming fully operational in uncontrolled environments with fast hand/fingers motion, self occlusions, noise, and low resolution [35]. Although the use of CNNs and depth cameras has allowed to model highly non-linear hand pose motion and finger deformation under extreme variations in appearance and viewpoints, accurate 3D-based hand pose recovery is still an open problem.

Two main strategies have been proposed in the literature for addressing the aforementioned challenges: Model-based and data-driven approaches. Model-based generative approaches fit a predefined 3D hand model to the depth image [29, 23, 25, 17, 22, 5]. However, as a many-to-one problem, accurate initialization is critical; besides, the use of global objective functions might not convey accurate results in case of self-occlusions of fingers.

Alternatively, the so-called data-driven approaches consider the available training data to directly learn hand pose from appearance. Data-driven approaches for hand pose estimation have been benefited from recent advances on Convolutional neural networks (CNNs)

[39, 11, 28, 30, 12, 13]

. CNNs, as in many other computer vision tasks, have been successfully applied in data-driven hand-pose recovery approaches either for heat-map regression of discrete outputs (corresponding to joint estimation probabilities), or direct regression of continuous outputs (corresponding to joint locations)

[20, 21, 33, 15, 26]. Heat-map regression models require additional optimization time for computing the likelihood of a joint being located at a particular spatial region. Unfortunately, heat-map based methods are subject to propagate errors when mapping images to final joint space. A main issue with CNNs as direct regression models, on the other hand, is how to deal with high nonlinear output spaces, since too complex models jeopardize generalization. Indeed for CNNs, learning suitable features (i.e. with good generalization and discrimination properties) in highly nonlinear spaces while taking into account structure and dependencies among parameters, is still a challenging task.

Figure 1: Qualitative comparison of our proposed approach vs state-of-the-art methods.

In this paper, direct regression of the 3D hand pose is implemented as a specific tree-shaped CNN architecture designed to avoid training a coarse, global hand motion model, but allowing instead local finer specializations for different fingers and hand regions. So we break the hand pose estimation problem into hierarchical optimization subtasks, each one focused on a specific finger and hand region. Combined together in a tree-like structure, the final CNN shows fast convergence rates due to computations applied at a local level. In addition, we model correlated motion among fingers by fusing the features, learned in the hierarchy, through fully connected layers and training the whole network in an end-to-end fashion. The main advantage of this strategy is that the 3D hand pose prediction problem is attained as a global learning task based on local estimations.

Moreover, it has been proved that

loss, in regression problems, is sensitive against outliers and ground-truth noise

[2]. Therefore, in order to further improve the final estimation in high non-linear spaces of hand configurations, we incorporate appearance and physical penalties in the loss function, based on the physical constraints typically applied in 3D reconstruction of human poses  [1]. By including such penalties during the network learning stage, unrealistic pose configurations are avoided.

Lastly, as it is common in deep learning problems, variability and amount of data defines the success of a model and it has been proved that CNN models can not generalize well to unseen data. In this paper we introduce a non-rigid augmentation approach to generate realistic data from training data. To the best of our knowledge this is the first time such augmentation is applied in depth images. We use ground truth joints to compute hand kinematic parameters and deform hand joints. We then apply interpolation techniques to deform point cloud based joints. Results demonstrate that our proposed framework trained on augmented data outperforms state-of-the-art data-driven approaches in NYU and MSRA datasets.

We qualitatively compare pose estimation state-of-the-art approaches w.r.t. ours in Fig. 1. The work of Tompson et al.[33] estimates 2D pose using joints heat-map only, thus providing poor pose estimation results in the case of noisy input images (second column). Oberweger et al.[20] results (DeepPrior) show that PCA is not able to properly model hand pose configurations. Oberweger et al.[21] improved previous results by applying an error feedback loop approach. However, error feedbacks do not provide accurate pose recovery for all the variability of hand poses. In essence, in our proposed local-based pose estimation framework, a separate network is trained for each finger. Subsequently, we fuse such learned local features to include higher order dependencies among joints, thus obtaining better pose estimation results than previous approaches.

2 Related Work

Hand pose estimation has been extensively studied in literature [7], we refer the reader to [25] for a complete classification of state-of-the-art works in the field. Here we focus mostly on recent works using CNNs and depth cameras.

Most CNN-based architectures in data-driven hand pose estimation approaches are specifically designed to be discriminative and generalizable. Although the success of such approaches depends on the availability and variability of training data, CNN models cope reasonably well with this problem, and two main families of approaches can be distinguished in the literature, namely heat-map and direct regression methods.

Heat-map approaches estimate likelihoods of joints for each pixel as a pre-processing step. In [33], a CNN is fed with multi resolution input images and one heat-map per joint is generated. Subsequently, an inverse kinematic model is applied on such heat-maps to recover the hand pose. Nevertheless, this approach is prone to propagate errors when mapping to the original image, and estimated joints may not correlate with the hand physics constraints. The work of [15]

extends this strategy by applying multi-view fusion of extracted heat-maps, where 3D joints are recovered from only three different viewpoints. In this approach, erroneous heat-maps are expected to be improved in the fusion step using complementary viewpoints. The key idea in this work is to reduce the complexity of input data by aligning all data with respect to the hand point cloud eigenvectors. For most heat-map based approaches, however, an end-to-end solution can be only achieved by considerably increasing the complexity of the model, e.g. introducing a cascading approach 

[38]. Although such approaches used to work well for 2D pose estimation in RGB images, they are not necessarily able to model occluded joints from complex hand poses in depth data.

As an alternative, a number of works propose direct regression for estimating the joint positions of the 3D hand pose based on image features [21, 20, 28]. As mentioned in [32], contrary to heat-map based methods, hand pose regression can better handle the increase in complexity of modeling highly nonlinear spaces. Although some approaches propose Principle Component Analysis (PCA) to reduce the pose space [15, 20], in general such linear methods typically fail when dealing with large pose and appearance variabilities produced by different viewpoints (as shown in Fig. 1).

Recently, error feedback [21] and cascading [28] approaches have proven to avoid local minima by iterative error reduction. Authors in [21] propose to train a generative network of depth images by iteratively improving an initial guess. In this sense, Neverova et al.[18] use hand segmentation as an intermediate representation to enrich pose estimation with iterative multi-task learning. Also, the method proposed in [28]

divides the global hand pose problem into local estimations of palm pose and finger poses. Thus, finger locations can be updated at each iteration relative to the hand palm. Contrary to our method, the authors use a cascade of classifiers to combine such local estimations.

Authors in [26]

apply a CNN to make use of the resulting feature maps as the descriptors for computing k-nearest shapes. Similarly to our approach, in their method the CNN separates palm and fingers and computes the final descriptor by dimensionality reduction. Differently to our approach, they factorize the feature vectors and nearest neighbors hyper-parameters to estimate the hand pose. In a different way, we propose training the network by fusing local features to avoid non-accurate local solutions, without the need of introducing cascading strategies nor multi-view set-ups. Contrary to the methods trying to simplify the problem by dividing the output space into subspaces, Guo

et al.[10] divided input image to smaller overlapping regions and fused CNN feature maps as a region ensemble network.

In CNN-based methods, data augmentation is a common approach to boost network to generalize better. Recently, Ge et al.[9] applied data augmentation in the problem of hand pose recovery and showed a meaningful improvement in the results. Even Oberweger et al.[19] extended DeepPrior model in [20] and showed effectiveness of a simple model trained with data augmentation. However, aforementioned approaches use simple and rigid data augmentation like scaling, rotation and translation, which may not represent the visual variability in terms of 3D articulated joints. Here, we propose a non-rigid data augmentation by deforming hand parameters and interpolating point cloud.

3 Global hand pose recovery from local estimations

Given an input depth image , we refer the 3D locations of hand joints as the set . We denote and as a given joint in the world coordinate system and after projecting it to the image plane, respectively. We define for the wrist, finger joints and finger tips, following the hand model defined in [28]. We assume a hand is initially visible in the depth image, i.e. not occluded by other objects in the scene, although may present self-occlusions, and properly been detected beforehand (i.e. pixels belonging to the hand are already segmented [33]). We also assume intrinsic camera parameters are available. We refer to global pose as the whole set , while, a local pose is a subset of (e.g. index finger joints).

Considering hand pose recovery as a regression problem with the estimated pose as output, we propose a CNN-based tree-shaped architecture, starting from the whole hand depth image and subsequently branching the CNN blocks until each local pose. We show the main components of the proposed approach in Fig. 2. In such a design, each network branch is specialized in each local pose, and related local poses share features in the earlier network layers. Indeed, we break global pose into a number of overlapping local poses and solve such simpler problems by reducing the nonlinearity of global pose. However, since local solutions can be easily trapped into local minima, we incorporate higher order dependencies among all joints by fusing the last convolutional layer features of each branch and train the network for global and local poses jointly. We cover this idea in Sec. 3.1. We also apply constraints based on appearance and dynamics of hand as a new effective loss function which is more robust against overfitting than simple loss while providing a better generalization. This is explained in Sec. 3.2.

3.1 Hand pose estimation architecture

In CNNs, generally, each filter extracts a feature from a previous layer, and by increasing the number of layers, a network is able to encode different inputs by growing the Field of View (FoV). During training, features are learned to be activated through a nonlinear function, for instance using Rectified Linear units (ReLU). The complexity and number of training data has a direct relation to the number of filters, layers or complexity of the architecture: an enormous number of filters or layers might cause overfitting, while a low number might lead to slow convergence and poor recognition rates. Interestingly, different architectures have been proposed to cope with these issues

[38, 21, 32]. For example, in multi-task learning, different branching strategies are typically applied to solve subproblems [6, 8], and the different subproblems are solved jointly by sharing features. Similarly, we divide global hand pose into simpler local poses (i.e. palm and fingers) and solve each local pose separately in a branch by means of a tree-shaped network. We show this architecture in Fig. 2.

Figure 2: Proposed network architecture. Branching strategy connects CNN blocks into a tree-shape structure while regressing local pose at each branch. Each local pose is a 24 dimensional vector. We also include a viewpoint regressor in th network as a rotation matrix in terms of quaternions at the output. We then fuse all the features of the last convolutional layers to estimate output global pose. We use features in the fusion to extract palm joints more accurate.

The proposed architecture has several advantages. Firstly, most correlated fingers share features in earlier layers. By doing this, we allow the network to hierarchically learn more specific features for each finger with respect to its most correlated fingers. Secondly, the number of filters per finger can be adaptively determined. Thirdly, the estimation of the global pose is reduced to the estimations of simpler local poses, resulting the network to train at fast convergence rates.

We define the amount of locality by the number of joints contributing to a local pose. Keeping such locality high (i.e. lower number of joints), in one hand, causes fingers to be easily confused among each other, or detected in a physically impossible location. A low locality value (i.e. higher number of joints), on the other hand, increases the complexity. Besides, local joints should share a similar motion pattern to keep lower complexity. So in the particular implementation in this paper, we assign to each local pose one finger plus palm joints, thus leading to a 24 dimensional vector.

Training the network only based on local poses omits information about inter-fingers relations. Tompson et al[34] included a graphical model within the training process to formulate joints relationships. Li et al[14] used a dot product to compute similarities for embedded spaces of a given pose and an estimated one in a structural learning strategy. Instead, we apply late fusion based on local features, thus, let the network learn the joint dependencies through fully connected layers for estimating the final global pose. The whole network is trained end-to-end jointly for all global and local poses given a constrained loss function.

Network details. Input images are pre-processed with a fix-sized cube centered on the hand point cloud and projected into the image plane. Subsequently, the resulting window is cropped and resized to a fixed size image using nearest neighbor interpolation, with zero-mean depth.

As intermediate layers, the network is composed of six branches, where each branch is associated to specific fingers as follows: two branches for index and middle fingers, two branches for ring and pinky fingers, one branch for thumb, and one branch for palm. For the palm branch, instead of performing direct regression on palm joints, we make regression on the palm viewpoint, defined as the rotation (in terms of quaternions) between the global reference view and the palm view. As shown in the experimental results, more accurate and reliable optimization is then achieved, since the network is able to model interpolations among different views.

As shown in Fig. 2, each convolutional block consists of a convolution layer with

filter kernels and a ReLU followed by a max-pooling, except for the last block. All pooling layers contain a

window. The last block contains a convolutional layer with

filter kernels, providing a feature vector. Fully connected layers are added to the end of each branch for both local and global pose learning. For local pose at each branch there are two hidden layers with 1024 neurons with a dropout layer in between. Similarly, for global pose at each branch, the feature vector is followed by two hidden layers with 1024 neurons with a dropout layer in between. Then, the last hidden layers are concatenated and followed by a dropout and a hidden layer with 1024 neurons. Finally, the global and local output layers provide the estimation of joints with one neuron per joint and dimension.

3.2 Constraints as loss function

In regression problems, the goal is to optimize parameters such that a loss function between the estimated values of the network and the ground-truth value is getting minimized. Usually, in the training procedure, an loss function plus a regularization term is optimized. However, it is generally known that, in an unbalanced dataset with availability of outliers, norm minimization can result in poor generalization and sensitivity to outliers where equal weights are given to the training data [2]. Weight regularization is commonly used in deep learning as a way to avoid overfitting. However, it does not guarantee the weight updating to bypass the local minima. Besides, a high weight decay causes low convergence rates. Belagiannis et al[2] proposed Tukey’s biweight loss function in the regression problems as an alternative to loss robust against outliers. We formulate the loss function as loss along with constraints applied to hand joints regarding the hand dynamics and appearance, leading to more accurate results and less sensitivity to ground-truth noise.We define the loss function for one frame in the form of:


where are factors to balance loss functions. , , and denote the loss for the estimated local and global pose, appearance, and hand dynamics, respectively. Next, each component is explained in detail.

Let be the concatenation of the estimated joints in each branch of the proposed network and be the ground-truth matrix. Note that is not necessarily equal to . and are the outputs of the embedded network for estimated joints and ground-truth, respectively. Then, we define local and global losses as:


A common problem in CNN-based methods for pose estimation is that in some situations estimated pose does not properly fit with appearance. For instance, joints are placed in locations where there is no evidence of presence of hand points, or being physically incorrect [21, 20, 15]. In this paper, during training we penalize those joint estimations that do not fit with the appearance or are physically not possible, and include such penalties in the loss function.

We first assume that, rationally, joints must locate inside the hand area and have a depth value higher than the hand surface, besides, joints must present physically possible angles in the kinematic tree. Therefore, for a given joint the inequality must hold, where is the pixel value at location . To avoid violating the first condition (i.e. when a joint is located outside hand area after projection to the image plane), we set the background with a cone function as:

where and are width and height of the image, and is a fixed value set to 100. The reason to use a cone function instead of a fixed large value is to avoid zero derivatives on the background. We use hinge formulation to convert inequality to a loss through:


We subsequently incorporate hand dynamics by means of the top-down strategy described in Algorithm 1. We assume all joints belonging to each finger (except thumb) should be collinear or coplanar. Thumb has an extra non-coplanar form and we do not consider it in the hand dynamics loss. A groundtruth finger state is assigned to each finger computed by the conditions defined in Algorithm 1. Each finger has a groundtruth normal vector which is finger direction for the case 1 and finger plane normal vector for the other cases. Therefore, we define four different losses, one of them triggered for each finger (as shown in Algorithm 1). Let , , and be four joints belonging to a finger starting in as the root joint and ending in as fingertip. Then the dynamics loss is defined as:


where denotes a finger index. Now we consider each case in Algorithm 1 in the following.

We consider a collinear finger in case 1. A finger is collinear if:

where is a threshold defining the amount of collinearity and set to . To compute the loss for a collinear groundtruth finger, the following condition has to be hold: where is a threshold. This condition has to be met for and as well. The cosine function can be extracted through dot product. Therefore, using hinge formulation, the loss is defined as:


where is a factor to balance different components of the loss function.

We consider a coplanar finger for cases 1, 2 and 3. We define a finger to be coplanar if cross products of all subsets of the finger joints with three members to be parallel. Note that a collinear finger is necessarily coplanar. However, we exclude collinear fingers from this definition due to cross product, as shown in Algorithm 1. For a groundtruth coplanar finger, such cross products must be parallel to the plane normal vector. Therefore, for given joints , and , the following condition must hold:

Given the groundtruth finger is coplanar of case , we compute the loss function as:


The loss functions for the other coplanar finger cases are computed in the same way.

3.3 Loss function derivatives

All components in Eq. 1 are differentiable, thus we are able to use gradient-based optimization methods. In this section we explain derivatives of the constraint loss function in Eq. 4. Derivatives of the rest of loss functions are computed through matrix calculations. We first define derivative of with respect to through:


In the following we just consider positive condition of Eq. 8. Besides, we omit index (which denotes -th joint) from the notations for the easiness of reading. Depth image is a discrete multi-variable function of and , where is a multi-variable function of and , and is a multi-variable function of and

. Consequently, the total derivative of a depth image can be computed by the chain rule through:


Next, we present components of derivative in detail222Derivatives belonging to are computed in the same way as . Depth image is a function of hand surface. However, hand surface given by the depth camera may have noise and not be differentiable at some points. To cope with this problem, we estimate depth image derivatives by applying hand surface normal vectors. Let to be the surface normal vector for a given joint. Then, derivative of with respect to axis is given by the tangent vectors through:


As mentioned, is the projection of the estimated joint from world coordinate to the image plane. Note that joints have zero mean and is extracted after the image has been cropped and resized. Let , , and to be the camera focal length and image center for axis, world coordinate hand point cloud center, and its projection to image plane, respectively. Then is computed as:


where is the cube size used around hand point cloud to crop the hand image. Using this formulation, derivative of can be easily computed and replaced in Eq. 10.

4 Experiments

In this section we evaluate our approach on two real-world datasets NYU [33] and MSRA [28], and one synthetic dataset SyntheticHand [16]. NYU dataset has around 73K annotated frames as training data (single subject) and 8K frames as test data (two subjects). Each frame has been captured from 3 different viewpoints and ground truth is almost accurate. MSRA dataset has 76K frames captured from 9 subjects each in 17 pose categories. This dataset does not provide an explicit training/test set and a subject exclusive 9-fold cross validation is used to train and evaluate on this dataset. MSRA dataset has smaller image resolution and less pose diversity and accurate ground truth comparing to NYU dataset. SyntheticHand dataset has over 700K training data and 8K test data consisting of a single synthetic subject performing random poses from all viewpoints, thus being useful to analyze our methodology under occlusions. All three datasets provide at least 20 hand joints in common. However, NYU dataset has 16 extra joints. We evaluate our approach using two metrics: average distance error in mm and success rate error [31]. Next, we detail the method parameters and evaluate our approach both quantitatively and qualitatively in comparison to state-of-the-art alternatives.

Figure 3:

Constraints derivatives during training process (original NYU dataset). Estimated joints along with derivatives of appearance and hand dynamics are illustrated for the first five epochs in the training process. We qualitatively show how proposed network converges very fast in few epochs.

4.1 Training

We utilize MatConvNet library [36] on a server with GPU device GeForce GTX Titan X

with 12 GB memory. We optimize the network using stochastic gradient descent (SGD) algorithm. We report hyper-parameters used in NYU dataset. We set the batch size, learning rate, weight decay and momentum to 50, 0.5e-6, 0.0005 and 0.9, respectively. Our approach converges in almost 6 epochs while reducing the learning rate by a factor of 10 for two more epochs. Overall, training takes two days on original NYU dataset while testing takes 50 fps.

Figure 4: Baseline architectures. a) A single channel network with the same convolutional capacity as branching network. We train this network with the same loss as in Eq. 1 omitting . b) A single channel network with the same capacity as one branch in hierarchical model and branching local pose on FC layers. We train this network with the same loss as in Eq. 1 omitting .
Figure 5: Quantitative results comparing baselines on NYU dataset. a) Training process in terms of average error per epoch. b) Comparing palm: joints regression vs. viewpoint regression. c) Maximum error success rate comparing baselines. d) Per joint average error comparing baselines.

Loss function parameters tuning We set a low value for parameter in Eq. 6 since it behaves like a regularization and it is not connected to ground-truth. is mainly a summation of cosine functions while is in millimeters. Therefore we set higher than to balance cosine space with millimeter. Finally, we set parameters , , , and experimentally to 4, 4, 3, 20 and 0.0005, respectively. We show derivatives of appearance and dynamics loss functions for a number of joints in the first five epochs in Fig. 3, as well as qualitative images of estimated joints.

Figure 6: Data augmentation. We generate new data by applying non-rigid hand shape deformation along with rigid transformations like in-plane rotation. Hand kinematic parameters are slightly deformed and new hand joints are used to interpolate hand point cloud. We also change palm and fingers size. Therefore, given a pose, different hand shapes can be generated which helps to generalize better to unseen subjects. First column shows the original images and others are generated samples.

4.2 Ablation study

In this section we study different components of the proposed architecture trained on NYU dataset. We denote each component by a number.

Locality. Locality is referred to the number of joints in the network output. In the first case, we analyze the hierarchical network trained just with one finger in each branch and without constraints and fusion network (so called 1:local). This network shows a high locality value. As one can expect, this network can easily overfit on the training data and exchange estimations for similar fingers. We show a significant improvement by decreasing locality by including palm joints in each branch (so called 2:1+palm). Palm joints are located in a near planar space and thus do not add high non-linearity to the output of each branch while help for better finger localization. We compare these methods in Fig. 5 (red vs. green lines).

Constraints. We train method 2:1+palm by including constraints in the loss function (Eq. 1) without in this stage (so called 3:2+constraint). We still do not explicitly model any relationship among fingers in the output space, but let the network learn each finger joints with respect to the hand surface and finger dynamics. In Fig. 5 we show the effectiveness of this strategy (magnet line) against method 2:1+palm. We also analyze the effect of constraints in the training process in Fig. 5. As it can be seen, by applying the proposed constraints, method 3:2+constraint is more robust against overfitting than method 1. Validation error in method 3 does not significantly change from epoch 7 to 15. Comparing both methods in epoch 20, method 1 has a lower error in training while its validation error is almost 1.5 times the validation error of method 3.

Branching strategy vs. single channel architecture. As baselines, we created two single channel networks with 6 convolutional layers, as shown in Fig. 4. The output of the first network (so called single-channel network) is 3D locations of the full set of joints. In this architecture, the capacity of convolutional layers are kept similar to the whole branching network. This network is trained with loss function without . The outputs of the second network (so called FC-branching network) is similar to the method 3:2+constraint. The capacity of convolutional layers in this architecture is similar to one branch in tree-structure network. The branching in this network is applied on FC layers. We train this network with the same loss as method 3:2+constraint. We train both networks with the same hyper-parameters introduced in Sec. 4.1. As one can see in Fig. 5, single-channel network (dashed magnet line) performs worse than method 3:2+constraint

, showing the effectiveness of the tree-structure network. This means regardless of the capacity of the network, in a single channel network, backpropagation of the gradients of the loss is not able to train network filters to map input image to a highly non-linear space in an optimal and generalizable solution. This is even worse for FC-branching network (dashed dark brown line).

Palm viewpoint vs. palm joints regression. We evaluate our palm joints vs. palm viewpoint regression in terms of success rate error in Fig. 5. Palm viewpoint regressor gives a rotation matrix in terms of quaternions. We convert quaternions to rotation matrix and use it to transform a predefined reference palm example. As it can be seen in the figure, palm viewpoint regression significantly reduces palm joints error.

Global vs. local pose. We add fusion network to method 3:2+constraint to model correlations among different local poses in an explicit way (so called 4:3+viewpoint+fusion). We include viewpoint regression features in the fusion as well. We illustrate the results in Fig. 5 (dashed blue line). Comparing to method 3:2+constraint, method 4:3+viewpoint+fusion improves performance for error thresholds below 30mm.

Per joint mean error. We also illustrate per joint mean error in Fig. 5. From the figure, as expected, a very local solution (method 1) performed the worst among the baselines. Comparing method 2 and 3:2+constraint in average error shows the benefits of applying constraints as loss, as well. By including viewpoint features in the fusion network, palm joints mean error was considerably reduced by method 4:3+viewpoint+fusion. Although method 4:3+viewpoint+fusion performed better for the pinky and ring fingertips, it did not achieve the best results for index and thumb fingertips.

Data augmentation. data augmentation is a common approach to boost CNN models with small deformations in the images. Mainly used data augmentation approaches are rotation, scaling, stretching and adding random noise to pixels. Such approaches are mainly rigid (rotation and scaling) or unrealistic (stretching). Here, we propose a realistic non-rigid data augmentation. As the first step, we remove redundant data by checking ground truth joints. In this sense a redundant data is an image which has a high similarity to at least one image in the training set. Such similarity is defined by maximum Euclidean distance among corresponding joints. Therefore, two images are similar if is below a threshold. We used threshold 10 mm for this task.

Our data augmentation is consisting of in-plane rotation, changing palm and fingers size and deforming fingers pose. We show some generated images in Fig. 6. In the following we explain details of data augmentation.

Figure 7: Auxiliary points ’*’ are added to the set of joints to avoid unrealistic warping in non-rigid hand augmentation.

The main idea in non-rigid hand deformation is to deform ground truth hand joints and interpolate point cloud based on new joints. We use thin plate spline (TPS) [3] as a standard interpolation technique to deform point cloud. However, to avoid extrapolation problem and unrealistic warping, we add some auxiliary points to the set of joints. We show some possible auxiliary points in Fig. 7: we mainly add points around wrist and thumb. We observed unrealistic deformation around thumb and by adding three fixed points we avoided extrapolation problems. For the wrist case we do not want to deform points of lower arm. Fixed auxiliary points around wrist add constraint to space avoiding unrealistic warping.

A first possible shape deformation is the changing of hand scale. However, a simple scaling does not guarantee generalization to unseen subjects. Instead, we change the size of fingers and palm. This can be seen in the Fig. 6 4th row. As the first step we compute hand kinematic parameters based on hand coordinate. Hand coordinate is defined by palm joints such that, in a quite open hand, thumb defines coordinate direction, other fingers define coordinate direction and coordinate is perpendicular to palm plane. Then, palm can be stretched in the direction of or . We stretch each direction by a random factor. Having kinematic parameters fixed, we are able to randomly modify fingers length and reconstruct new joints for each finger. It is also likely to slightly modify kinematic parameters and reconstruct joints in a new pose. However, we keep kinematic parameters near to original values to avoid unrealistic point cloud deformations and possible big holes in the depth image. Finally we apply morphological operations to fill small gaps. 333Code will be publicly available after publication.

In NYU dataset, around 60K images were remained after removing redundant images from all 218K samples in the training set (including all cameras). We then generated two sets of augmented images including around 780K and 1500K. We use random scaling factors in the range for palm and fingers. Kinematic parameters are changed by summation to random degrees in the range . The only difference in generated sets is the in-plane rotation degrees. The first and second sets have in-plane rotation in the range and degrees, respectively.

We compare the results on both generated sets in Fig. 5 (brown and dark green lines). We train method 4:3+viewpoint+fusion on these new 2 sets, so called method 5:4+aug1 and method 6:4+aug2, respectively. One can see the model trained on the set with more samples and wide in-plane rotation degrees (method 6:4+aug2) generalizes better to the test set. Also, a significant improvement is achieved comparing to the original data (method 4:3+viewpoint+fusion). We observed that wrist joint has the maximum error in 20% of the cases in method 6:4+aug2. Therefore we replaced estimated palm joints in method 6:4+aug2 with estimated palm joints from viewpoint regressor and slightly improved performance (final method 7:6+palm). We also illustrate method 7:6+palm per joint mean error in Fig. 5. Fingertips have the highest error among the joints. Data augmentation helps to significantly improve the fingertip estimation, as we can see in Fig. 9 comparing different baselines qualitatively.

Figure 8: State-of-the-art comparison. a) and b) Mean and maximum error success rate on NYU dataset. c) Maximum error success rate on MSRA dataset. d) Mean error success rate on SyntheticHand dataset.

4.3 Comparison with state of the art

We report method the performance of our final model comparing to state-of-the-art data-driven approaches like [33], [21], [18], [26], [9], [10], [40] and [19] on NYU dataset. On MSRA dataset we compare to [28], [15], [37], [9] and [19]. Finally, we compare to [20] and [16] on SyntheticHand dataset.

NYU dataset. Mentioned works in the comparison use 14 joints (as proposed in [33]) to compare on NYU dataset. For a fair comparison on this dataset we take 11 joints most similar to [33] out of our 20 used joints. We show maximum error success rate results in Fig. 8. As one can see, we outperform state-of-the-art results. However, [10] and [19] performs slightly better for error thresholds lower than 13mm. We also illustrate the average error success rate in Fig. 8. This shows our method is performing well in average for a majority of frames, i.e. less than 10mm error for 60% of the test set. We compare to state-of-the-art regarding overall mean error in table 1. All these results show a significant improvement using data augmentation.

Method Average 3D
error (mm)
Oberweger et al. [20] (DeepPrior) 19.8
Oberweger et al. [21] (Feedback) 16.2
Neverova et al. [18] 14.9
Guo et al. [10] (Ren) 13.4
Oberweger et al. [19] (DeepPrior++) 12.3
Ours (4:3+viewpoint+fusion) 15.6
Ours (final) 11.0
Table 1: Average 3D error on NYU dataset.
Method Average 3D
error (mm)
Sun et al. [28] 15.2
Wan et al. [37] (CrossingNet) 12.2
Oberweger et al. [19] (DeepPrior++) 9.5
Ours (4:3+viewpoint+fusion) 12.9
Ours (final) 9.7
Table 2: Average 3D error on MSRA dataset.

MSRA dataset. We applied introduced non-rigid hand augmentation same as NYU dataset. However, we observed a divergence during training. A possible reason could be the accuracy of ground truth annotations in MSRA dataset. Therefore, we applied standard augmentation techniques such as random scaling (in range ) and rotation (in range degrees). We show the maximum error success rate results in Fig. 8. As it can be seen, our method slightly outperform methods in the comparison for the error threshold between 13mm and 40mm. Although Sun et al.[28] has a higher number of good frames for errors lower than 11mm, it performs the worst for higher error rates. Without using data augmentation, our method (dashed blue line) performs slightly worse than [15]. Note that [15] uses a pre-alignment over samples given hand point cloud eigenvectors which can be assumed as a kind of augmentation. We also show average error in table 2. In average, our method with standard augmentation performs slightly similar to [19] in this dataset. Note that [19] uses random translation in the augmentation as well. We show some qualitative results in Fig. 9. As one can see, ground truth annotations are not accurate in some cases, more specifically for thumb.

SyntheticHand dataset. We use original training set without augmentation to train our model on this dataset. Our model converges in 7 epochs. Mean error success rate is shown in Fig. 8. As it can be seen, our method performs quite well on this dataset even for complex poses and viewpoints. Some qualitative results are shown in Fig. 9. The overall average error on this dataset is 3.94mm.

Figure 9: Qualitative results. a) NYU dataset. Rows from top to bottom: depth image, ground truth, single channel network, method 4:3+viewpoint+fusion and final solution. Last three columns show maximum error higher than 50mm. b) MSRA dataset. Rows from top to bottom: depth image, ground truth and final solution. Last three columns show maximum error higher than 50mm. Ground truth annotations show inaccurate ground truth for this dataset especially for thumb. 12th column has inaccurate ground truth for little finger. c) SyntheticHand dataset. Rows from top to bottom: depth image, ground truth and final solution. See text for details of the methods.

5 Conclusions

We proposed a novel hierarchical tree-like structured CNN for recovering hand poses in depth maps. In this structure, branches are trained to become specialized in predefined subsets of the hand joints. We fused a network based on learned local features to model higher order dependencies among joints. The network is trained end-to-end. By including a new loss function incorporating appearance and physical constraints about doable hand motion and deformation, we found our network helps to increase the precision of the final hand pose estimations for quite challenging datasets. In particular, we found fusion network can help to better localize joints for easier hand configurations while it behaves similar to a local solution for more complex cases. We improved palm joints by applying a viewpoint regressor, and by fusing its learned features into the global pose. Finally, we introduced a non-rigid hand augmentation technique to deform original hands in terms of shape and pose helping to generalize better to unseen data. As a result we significantly improved estimations on original NYU dataset by 4.6mm in average. As future work, we will consider more complex data augmentation techniques to cope with noise in the depth image. Realistic data can be combined with synthetic data as well. In this sense, we will work on filling gaps realistically when more complex pose deformations are applied in the augmentation.


This work has been partially supported by the Spanish projects TIN2015-65464-R and TIN2016-74946-P (MINECO/FEDER, UE) and CERCA Programme / Generalitat de Catalunya. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.


  • [1] I. Akhter and M. J. Black. Pose-conditioned joint angle limits for 3d human pose reconstruction. In CVPR, pages 1446–1455, 2015.
  • [2] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab. Robust optimization for deep regression. In ICCV, pages 2830–2838, 2015.
  • [3] F. L. Bookstein. Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Transactions on pattern analysis and machine intelligence, 11(6):567–585, 1989.
  • [4] C. Choi, A. Sinha, J. H. Choi, S. Jang, and K. Ramani. A collaborative filtering approach to real-time hand pose estimation. ICCV, 2015.
  • [5] M. de La Gorce, D. J. Fleet, and N. Paragios. Model-based 3d hand pose estimation from monocular video. PAMI, 33(9), 2011.
  • [6] A. Dosovitskiy, J. Tobias Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. pages 1538–1546, 2015.
  • [7] A. Erol, G. Bebis, M. Nicolescu, R. D. Boyle, and X. Twombly. Vision-based hand pose estimation: A review. Computer Vision and Image Understanding, 108(1–2):52 – 73, 2007.
  • [8] X. Fan, K. Zheng, Y. Lin, and S. Wang. Combining local appearance and holistic view: Dual-source deep neural networks for human pose estimation. pages 1347–1355, 2015.
  • [9] L. Ge, H. Liang, J. Yuan, and D. Thalmann. 3d convolutional neural networks for efficient and robust hand pose estimation from single depth images. In CVPR, 2017.
  • [10] H. Guo, G. Wang, X. Chen, C. Zhang, F. Qiao, and H. Yang. Region ensemble network: Improving convolutional network for hand pose estimation. ICIP, 2017.
  • [11] J. S. S. III, G. Rogez, Y. Yang, J. Shotton, and D. Ramanan. Depth-based hand pose estimation: methods, data, and challenges. arXiv:1504.06378v1, 2015.
  • [12] C. Keskin, F. Kirac, Y. Kara, and L. Akarun. Real time hand pose estimation using depth sensors. ICCV Workshops, pages 1228–1234, 2011.
  • [13] F. Kirac, Y. E. Kara, and L. Akarun. Hierarchically constrained 3d hand pose estimation using regression forests from single frame depth data. Pattern Recognition Letters, 50(0):91 – 100, 2014.
  • [14] S. Li, W. Zhang, and A. B. Chan. Maximum-margin structured learning with deep networks for 3d human pose estimation. ICCV, 2015.
  • [15] J. Y. Liuhao Ge, Hui Liang and D. Thalmann. Robust 3d hand pose estimation in single depth images: from single-view cnn to multi-view cnns. CVPR, 2016.
  • [16] M. Madadi, S. Escalera, A. Carruesco, C. Andujar, X. Baró, and J. Gonzàlez. Occlusion aware hand pose recovery from sequences of depth images. IEEE International Conference on Automatic Face and Gesture Recognition, 2017.
  • [17] A. Makris, N. Kyriazis, and A. Argyros. Hierarchical particle filtering for 3d hand tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 8–17, 2015.
  • [18] N. Neverova, C. Wolf, F. Nebout, and G. Taylor.

    Hand pose estimation through semi-supervised and weakly-supervised learning.

    CVIU, 2017.
  • [19] M. Oberweger and V. Lepetit. Deepprior++: Improving fast and accurate 3d hand pose estimation. In ICCVW, 2017.
  • [20] M. Oberweger, P. Wohlhart, and V. Lepetit. Hands deep in deep learning for hand pose estimation. Computer Vision Winter Workshop, 2015.
  • [21] M. Oberweger, P. Wohlhart, and V. Lepetit. Training a feedback loop for hand pose estimation. ICCV, 2015.
  • [22] I. Oikonomidis, N. Kyriazis, and A. Argyros. Efficient model-based 3d tracking of hand articulations using kinect. BMVC, pages 101.1–101.11, 2011.
  • [23] C. Qian, X. Sun, Y. Wei, X. Tang, and J. Sun. Realtime and robust hand tracking from depth. CVPR, 2014.
  • [24] S. S. Rautaray and A. Agrawal. Interaction with virtual game through hand gesture recognition. In Multimedia, Signal Processing and Communication Technologies (IMPACT), 2011 International Conference on, pages 244–247. IEEE, 2011.
  • [25] T. Sharp, C. Keskin, D. Robertson, J. Taylor, J. Shotton, D. Leichter, K. Christoph, R. Ido, A. V. Y. Wei, D. F. P. K. E. Krupka, A. Fitzgibbon, and S. Izadi. Accurate, robust, and flexible real-time hand tracking. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pages 3633–3642. ACM, 2015.
  • [26] A. Sinha, C. Choi, and K. Ramani.

    Deephand: robust hand pose estimation by completing a matrix imputed with deep features.

    pages 4150–4158, 2016.
  • [27] S. Sridhar, A. Oulasvirta, and C. Theobalt. Interactive markerless articulated hand motion tracking using rgb and depth data. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013.
  • [28] X. Sun, Y. Wei, S. Liang, X. Tang, and J. Sun. Cascaded hand pose regression. In CVPR, 2015.
  • [29] D. J. Tan, T. Cashman, J. Taylor, A. Fitzgibbon, D. Tarlow, S. Khamis, S. Izadi, and J. Shotton. Fits like a glove: Rapid and reliable hand shape personalization. CVPR, 2016.
  • [30] D. Tang, H. Chang, A. Tejani, and T.-K. Kim. Latent regression forest: Structured estimation of 3d articulated hand posture. CVPR, 2014.
  • [31] J. Taylor, J. Shotton, T. Sharp, and A. Fitzgibbon. The vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation. pages 103–110, 2012.
  • [32] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. pages 648–656, 2015.
  • [33] J. Tompson, M. Stein, Y. LeCun, and K. Perlin. Real-time continuous pose recovery of human hands using convolutional networks. TOG, 33(5), 2014.
  • [34] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler. Joint training of a convolutional network and a graphical model for human pose estimation. NIPS, 2014.
  • [35] J. Usabiaga, A. Erol, G. Bebis, R. Boyle, and X. Twombly. Global hand pose estimation by multiple camera ellipse tracking. Machine Vision and Applications, 2009.
  • [36] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab. In Proceeding of the ACM Int. Conf. on Multimedia, 2015.
  • [37] C. Wan, T. Probst, L. Van Gool, and A. Yao. Crossing nets: Dual generative models with a shared latent space for hand pose estimation. CVPR, 2017.
  • [38] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. CVPR, 2016.
  • [39] C. Xu and L. Cheng. Efficient hand pose estimation from a single depth image. ICCV, pages 3456–3462, 2013.
  • [40] C. Xu, L. N. Govindarajan, Y. Zhang, and L. Cheng. Lie-x: Depth image based articulated object pose estimation, tracking, and action recognition on lie groups. International Journal of Computer Vision, 123(3):454–478, 2017.