Hand pose estimation from a single depth image lays the foundation to used user interaction technique on a head-mounted Augmented Reality (AR) device, e.g., Microsoft Hololens, Magical Leap One. It has the advantage that users can provide input to devices efficiently. Despite recent remarkable progress, this problem still remains unsolved because of the large pose variation, finger joints intraclass similarities and occlusion.
Recently, Deep Learning has become popular in the community of computer vision and also achieves state-of-the-art on the 3D hand pose estimation tasks. These methods can be roughly classified into two categories. The first category treats the input depth image as a gray scale image and apply 2D convolutional neural network directly on the depth image. Representative methods are A2J[Xiong et al.2019], V2V-Posente [Moon et al.2018]. The second category of methods use 3D information. These methods either convert depth images into 3D voxels [Ge et al.2017] or point clouds [Ge et al.2018] and then followed by 3D CNN or point net respectively.
These neural networks are trained with hand regions extracted from depth images. However, the extract hand regions can highly influence on the accuracy of pose estimation. For example, in [Sinha et al.2016], the users wears a colorful wristband which is used to determine the hand regions as show in Figure 1 (A). This method is impractical in real cases. In [Chen et al.2019], hand regions are initialized using a shallow CNN. However, it can introduce arms or other foreground regions (Figure 1 (B)). In [Wan et al.2018], hand regions are obtained using groundtruth which is not available in real application (Figure 1 (C)).
In addition, these deep learning based methods are effective only if a large amount of training data is available. The data is usually collected and labelled manually, which is tedious and time consuming. This problem is even worsen for 3D computer vision problems which require to label with 3D data and this task is more difficult for humans. Recently, many works therefore focus on using computer graphics methods to synthesize images [Rad et al.2018, Malik et al.2018]. However, the resulting performances are usually suboptimal because synthetic images do not correspond exactly to real images.
In this paper, we propose HandAugment, a method to synthesize image data to augment the training of hand pose estimation method. First, We propose a scheme of two-stage neural networks to tackle the hand region extraction. This scheme can gradually find the hand regions in the input depth maps. Second, we propose a data synthesis method based on MINO [Romero et al.2017]. Because synthetic images do not correspond exactly to real data, therefore we combines real data with synthetic images together. Finally, we apply HandAugment to different datasets and the experiment shows our method can greatly improve the performance and achieves state-of-the-art results.
2 Relate Work
In this section we review related works of our proposed method, including depth-base 3D hand pose estimation and data augmentation method.
2.1 Depth-Based 3D Hand Pose Estimation
Hand pose estimation, has been explored widely using various kinds of techniques in the field of computer vision. Related neural network-based hand pose estimation and gesture recognition approaches using depth images are reviewed as follows.
The goal of hand pose estimation is to estimate the 3D location of hand joints from one or more frames recorded from a depth camera. The neural network based methods can be roughly classified into two categories: 2D CNN and 3D CNN. The 2D CNNs estimate hand pose directly from depth images, representative methods include a cascaded multistage method [Chen et al.2019], a structure-aware regression approach [Taylor et al.2016], and hierarchical tree-like structured CNNs [Madadi et al.2017]. These methods take depth maps as 2D images for the input of neural networks. Therefore, these neural networks are unable to fully capture the 3D information. The 3D CNNs convert 2D depth images into 3D data structure, such as 3D voxel grids for [Moon et al.2018] or D-TSDF volumes [Ge et al.2017], and they produce state-of-the-art results.
2.2 Data Augmentation Method
Data augmentation is a strategy to significantly increase the diversity of data to train models, without collecting additional training data. In recent years, Data augmentation techniques such as cropping, padding and flipping are commonly used in deep learning. This strategy can improve the performance of these data-driven tasks, suck like object recognition and hand pose estimation. It has already been widely used in recent work[Xiong et al.2019], [Yang et al.2019], [Guo et al.2017] and [Oberweger and Lepetit2017]. Most of these data augmentation methods use image transformation methods, including in-plain translation, rotating, scaling and mirroring. Specifically, for color-based methods, training images can be augmented by adjusting the hue channel of the color images [Yang et al.2019]. For depth-based methods, images can be augmented by applying 3D transformation, such as [Ge et al.2017] which randomly rotates and stretches the 3D cloudy to synthesize training data.
We first give an overview of HandAugment in Section 3.1. After that we present details about how to create pose-guided hand augmentation in Section LABEL:augment. In Section 3.2 we provide the details of Network Architecture. Finally, the implementation details are given in Section 3.4.
Given a depth image , the task of hand pose estimation is to estimate the 3D locations of hand joints. We use a scheme of two-stage neural networks to estimate the hand poses, as illustrated in Figure 2. We first feed the input depth image into the first neural network (denoted as ) which estimates an initial hand pose . This initial hand pose is used to extract a augmented hand region from input depth image. Finally, this augmented hand region is feed into the second neural network (denoted as ) to estimate the final hand pose .
3.2 Two-Stage Network Scheme
We use a scheme of two-stage neural networks to estimate the hand poses. The input of the first stage neural network is a patch extracted from input depth image. The patch center and patch size are determined using MCP joint model. The depth values in the extracted patches are truncated into and then normalized into , where is empirically set as 300 mm. Then we feed this patch into to get the pose . To get the augmented hand patch, we first find the maximum and minimum values from : , , , , , . Then for any point in that is out of the range , , and is removed from where and are set to 30 mm and is set to 20 mm. This new patch is denoted as and then is fed into to get the final hand poses. This process is illustrated in Figure 3.
The architectures of our two networks are based on EfficientNet-B0 [Tan and Le2019]. We give the architecture of our modified EfficientNet-B0 in Table 1. The input of these two network are image patches cropped from input depth image. The cropped patches are resized to before feeding into the networks. The output of these two networks is a
-dimensional vector indicates the 3D locations of thehand joints. Beside the input and output, the rest of the architectures are the same as that of the original EfficientNet-B0.
|9||Conv1x1 & Pooling & FC||63||1|
We train and with a Wing Loss [Feng et al.2018], because the Wing Loss is robust for both small and large pose deviations. Given an estimated pose of the joint and its corresponding ground truth , the Wing Loss is defined as:
where , controls the width of non-linear part to be within , limits the curvature of the nonlinear part, and links the linear and non-linear parts together. The parameter and are empirically set as and respectively.
3.3 Data Augmentation
Our method is based on MANO [Romero et al.2017] to synthesize training data. MANO renders a depth image containing a right hand using three parameters: a camera parameter , a hand pose parameter and a shape parameter . The camera parameter is a 8-dimensional camera parameter including scale , translation along three camera axes, and global rotation (in quaternion). The hand pose parameter is a 45-dimensional vector, and the shape parameter is a 10-dimensional vector. To obtain MANO parameters, we use the HANDS19 dataset which uses gradient based optimization [Baek et al.2019] to estimate MANO parameters from real images (Figure 4 (a)). And then we uses these estimated MANO parameters to synthesize images (Figure 4 (b)).
We have four strategies to prepare training data. The first two are to use real data and the original MANO parameters provided by HANDS19 directly. We do not add, remove or modify any data. These two dataset contain totally 170K images respectively. We call these two datasets as Real Dataset (RD) and Synthetic Dataset (SD) respectively.
Third, because the distribution of SD and RD differently. Therefore, we propose to use a linear blending method to combine synthetic images with real images as shown in Figure 4. Given a synthetic image and its corresponding real image , the final mixed image is given:
We create totally 170K mixed synthetic images. This dataset is denoted as the Mixed synthetic Dataset (MD).
Lastly, in order to generate more training data, we create new MANO parameters by add Gaussian noise to the original three MANO parameters , and
provided by HANDS19. The Gaussian distribution is obtained by assuming that each dimension of the three parameters are independent and the noise follow the same distribution as the original data. To generate the data, we add noise to only one of the three parameters or all of them. Totally, we create 400K images, 100K for camera parameters, 100K for hand pose (articular) parameters, 100K for shape parameters and 100K for all of the three parameters. We denote this dataset as Noised synthetic Dataset (ND).
3.4 Implementation Details
We train our network on a workstation equipped with a Intel Xeon Platinum 8160 CPU and two NVIDIA GEFORCE RTX 2080 Ti GPUs. We implementation the network using pytorch. To train, the batch size and learning rate are set 128 and respectively and Adamax [Kingma and Ba2014] is used to optimize. Step-wise learning rate scheduler is used. The network is trained using all the training data, including RD, SD, MD and ND, and we have 640K images in total. Then is fine tuned from . The batch size and learning rate are also set 128 and respectively. The optimizer is also Adamax. Step-wise learning rate scheduler is also used. is trained using RD and SD.
We first introduce the datasets and evaluation metrics in the experiments. Afterwards we discuss the contributions of different components of our method.
HANDS19 Dataset [Hands2019] This dataset was sampled from BigHand2.2M [Yuan et al.2017]. Training set contains 175K images from 5 different subjects. Some hand articulations and viewpoints are strategically excluded in training set. Test set contains 125K images from different subjects, subjects overlap with the training set, exhaustive coverage of viewpoints and articulations. Images are captured with Intel RealSense SR300 camera at pixel resolution. The annotation of hand pose contains joints, with joints for each finger and joint for the palm. This dataset has large viewpoint, articulations and hand shape variations, which makes it a rather challenging dataset.
4.2 Experiment Results
We conduct several experiments to demonstrate the effectiveness of our method. First, we compare our method with the state-of-the-art method A2J [Xiong et al.2019] on Hands2019 challenge dataset and the results are given in Table 2. By leveraging a large dataset, our method can out perform A2J. We believe that by combing A2J and our method, we can further improve the performance.
|A2J [Xiong et al.2019]||13.74|
Then we use a single stage scheme and RD only to train the network. Note that the single stage scheme only contains one EfficientNet-B0 network and the results are given in Table 3. We can see that our two-stage scheme can improve the performance whether we use the data augmentation method or not. Besides, our augmentation method can also improve the performance of both the single-stage scheme and two-stage scheme.
|Model Name||Network||Data||Error (mm)|
|Model Name||Method||Error (mm)|
|Model Name||Data||Error (mm)|
|Model Name||Method||Error (mm)|
|Model_2||Viewpoint + Articulate + Shape||15.73|
Our two-stage scheme contains two neural networks and the second stage network is fine-tuned on the first stage network. To show how the fine-tuning can improve the performance, we give the results in Table 4
. Note that these results are obtained using real data only. Our two-stage scheme without fine-tuning performs worse than the single-stage scheme. This probably because that the distributions of real data and synthetic data are different. The number of synthetic data (including RD, MD, ND) are much larger than RD. As a result, the pretrainedcan be less fitted to RD. Therefore, we fine tune the second stage network by reducing synthetic data (only RD and ND). The results show it can greatly improve the performance.
We propose three strategies to synthetic training data as introduced in Section 3.3. We train the neural network by using different combination of the three strategies and the results are given in Table 5. Note that the results are obtained on the single-stage scheme. We can see that the performance can be gradually improved as we add SD, MD and ND into training.
Finally, we show how the third data synthesis strategy can influence on the performance. We give the result in Table 6. Note that these results are also obtained by the single-stage network. We add noise in one the three parameters, including camera, hand pose and shape or all the three parameters. We can see that the performance can be improved even we add noise into only one parameter.
In this paper, we propose HandAugment, a method to synthesize image data to augment the training of hand pose estimation method. First, We propose a scheme of two-stage neural networks to tackle the hand region extraction. This scheme can gradually find the hand regions in the input depth maps. Second, we propose a data synthesis method based on MANO. We have three strategies to prepare the training data: using the original MANO parameters, mixed real and synthetic data and noised synthetic data. Finally, we conduct several experiments to demonstrate that HandAugment is effective to improve the performance and achieves state-of-the-art results compared to existing method in Hands 2019 challenge.
[Baek et al.2019]
Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim.
Pushing the envelope for rgb-based dense 3d hand pose estimation via
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1067–1076, 2019.
- [Chen et al.2019] Xinghao Chen, Guijin Wang, Hengkai Guo, and Cairong Zhang. Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing, 2019.
- [Feng et al.2018] Zhen-Hua Feng, Josef Kittler, Muhammad Awais, Patrik Huber, and Xiao-Jun Wu. Wing loss for robust facial landmark localisation with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2235–2245, 2018.
- [Ge et al.2017] Liuhao Ge, Hui Liang, Junsong Yuan, and Daniel Thalmann. 3d convolutional neural networks for efficient and robust hand pose estimation from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1991–2000, 2017.
- [Ge et al.2018] Liuhao Ge, Yujun Cai, Junwu Weng, and Junsong Yuan. Hand pointnet: 3d hand pose estimation using point sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8417–8426, 2018.
- [Guo et al.2017] Hengkai Guo, Guijin Wang, Xinghao Chen, and Cairong Zhang. Towards good practices for deep 3d hand pose estimation. arXiv preprint arXiv:1707.07248, 2017.
- [Hands2019] Hands. Hands 2019 challenge. https://sites.google.com/view/hands2019/challenge, 2019.
- [Kingma and Ba2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [Madadi et al.2017] Meysam Madadi, Sergio Escalera, Xavier Baró, and Jordi Gonzalez. End-to-end global to local cnn learning for hand pose recovery in depth data. arXiv preprint arXiv:1705.09606, 2017.
- [Malik et al.2018] Jameel Malik, Ahmed Elhayek, Fabrizio Nunnari, Kiran Varanasi, Kiarash Tamaddon, Alexis Heloir, and Didier Stricker. Deephps: End-to-end estimation of 3d hand pose and shape by learning from synthetic depth. In 2018 International Conference on 3D Vision (3DV), pages 110–119. IEEE, 2018.
- [Moon et al.2018] Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5079–5088, 2018.
- [Oberweger and Lepetit2017] Markus Oberweger and Vincent Lepetit. Deepprior++: Improving fast and accurate 3d hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 585–594, 2017.
- [Rad et al.2018] Mahdi Rad, Markus Oberweger, and Vincent Lepetit. Feature mapping for learning fast and accurate 3d pose inference from synthetic images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4663–4672, 2018.
- [Romero et al.2017] Javier Romero, Dimitrios Tzionas, and Michael J Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics (TOG), 36(6):245, 2017.
- [Sinha et al.2016] Ayan Sinha, Chiho Choi, and Karthik Ramani. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4150–4158, 2016.
- [Tan and Le2019] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
- [Taylor et al.2016] Jonathan Taylor, Lucas Bordeaux, Thomas Cashman, Bob Corish, Cem Keskin, Toby Sharp, Eduardo Soto, David Sweeney, Julien Valentin, Benjamin Luff, et al. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Transactions on Graphics, 35(4):143, 2016.
- [Wan et al.2018] Chengde Wan, Thomas Probst, Luc Van Gool, and Angela Yao. Dense 3d regression for hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5147–5156, 2018.
- [Xiong et al.2019] Fu Xiong, Boshen Zhang, Yang Xiao, Zhiguo Cao, Taidong Yu, Joey Tianyi Zhou, and Junsong Yuan. A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image. In Proceedings of the IEEE International Conference on Computer Vision, pages 793–802, 2019.
- [Yang et al.2019] Linlin Yang, Shile Li, Dongheui Lee, and Angela Yao. Aligning latent spaces for 3d hand pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2335–2343, 2019.
- [Yuan et al.2017] Shanxin Yuan, Qi Ye, Bjorn Stenger, Siddhant Jain, and Tae-Kyun Kim. Bighand2. 2m benchmark: Hand pose dataset and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4866–4874, 2017.