Gesture recognition is an essential component needed for a robust Human-Robot Interaction (HRI) [3, 8, 10, 26, 5, 19]. Since robot’s malfunctions may occur due to the noisy sensor measurements or operator’s speech disabilities, an alternative technique has to be adopted to allow the agent to understand the human’s intentions and, consequently, react accordingly 
. However, gesture recognition is a challenging problem in the robotics community as the hand is defined through a small area compared to the human body. Moreover, it exhibits a high degree of freedom and high similarity, i.e., among the finger joints’ visual appearance. More difficulties occur due to occlusions, negative effects originating from changing illumination, or the noise of the camera. The timing required for a robot to interact with its operator constitutes another considerable challenge in HRI since real-time responses are necessary for a reliable application. Skeleton-based methods[30, 9, 29, 24], generally follow two processing steps. At first, they detect hand landmarks, a task widely-known as hand pose estimation, and subsequently, they recognize gestures according to these points. Some approaches make use of depth cameras to locate the 3D hand landmarks [31, 32, 36, 11, 37]. Nevertheless, these modules are expensive for small and low-cost platforms. Therefore, it is necessary to study hand pose estimation using a monocular camera as the primary sensing modality.
Based on the technique they use to tackle the problem, modern methods are distinguished to the top-down and bottom-up ones. The former firstly detect the hand and afterwards locate its landmarks. On the contrary, bottom-up frameworks initially detect the landmarks, and next, they cluster them to form a hand. Both strategies present advantages and disadvantages. Regarding the top-down methods, they repeatedly run on each hand bounding-box to localize landmarks resulting in the linearly increased run-time. The bottom-up approaches detect landmarks from the whole image but have a low recall rate because the complex background causes false positives.
Furthermore, hand pose estimation pipelines are divided into two categories according to their output. Methods that utilize coordinate regression are placed in the first [44, 6, 34], whereas algorithms which extract heatmaps belong to the second one [12, 43, 28, 4, 17, 35]. More specifically, approaches that fall into the first category directly predict the landmark coordinates from images. However, this is not accurate and has limitations when encountering occlusions and large poses. Frameworks belonging to the second category infer each hand joint’s likelihood heatmap and then generate their coordinates. These techniques offer high accuracy scores; nevertheless, they present the disadvantage of a slower execution time, making them impracticable for resource-constrained mobile platforms. This is mainly due to their used network structure since they do not follow a lightweight architecture. To the best of our knowledge, there is no practical method performing on embedded systems while generating accurate hand pose results in real-time.
Considering the aforementioned reasons, in this paper, we propose a lightweight and top-down pose estimation technique, dubbed “FastHand”. Our method firstly detects the hand using a monocular camera, afterwards crops the Region Of Interest (ROI), and then localize its 2D landmarks based on a heatmap regression (Fig. 1). An encoder-decoder network design is employed. Regarding the encoder, a low amount of parameters are maintained while learning high-level information by using convolution, skip connection , and depthwise separable convolutions . Concerning the decoder, an improvement in feature map resolution is achieved through a deconvolution process, which preserves the high resolution of the hand and avoids the false recognition of the background. Moreover, the heatmap regression accuracy is also improved. Extensive experiments are conducted on two publicly available datasets demonstrating our method’s superior performance compared to other state-of-the-art frameworks. The main contributions can be summarized below:
A novel, lightweight encoder-decoder network, entitled “FastHand”, is proposed for hand pose estimation based on heatmap regression. The proposed framework is able to perform on a mobile robot in real-time.
A 2D hand landmark dataset is generated based on the Youtube3D Hands image-sequence.
Extensive experiments are performed to demonstrate the effectiveness of the proposed method.
The rest of the paper is organized as follows. Section II reviews the related study about the hand pose estimation task. In Section III, the proposed method is described, while Section IV evaluates the system’s performance and reports the comparative results. Finally, our paper concludes in Section V.
Ii Related Work
Ii-a Coordinate Regression-based Approaches
proposed the first large-scale, multi-view, and 3D annotated real dataset, wherein a Convolutional Neural Network (CNN) was trained. Subsequently, this framework could predict hand poses from the incoming RGB images with certain ability of generalization. NSRM  adopted a cascaded multi-task architecture aiming to jointly learn the hand structure and its landmarks’ representation . A novel CNN with a “self-attention” module was proposed in . This network had only 1.9 million parameters making it capable of deployment on embedded systems. ‘MediaPipe Hands”  first determines the position of hands with a palm detector, and then inputs the bounding-box containing hands to the landmark model to directly regress 21 2.5D hand landmarks.
Ii-B Heatmap Regression-based Approaches
Methods based on heatmap regression showed an improved performance regarding the final accuracy than the one achieved via the coordinate regression. In , a deep network was proposed to estimate the 3D hand pose originated from RGB images. This framework detected the hand’s location, and then it cropped the selected region. Subsequently, an encoder-decoder network extracted the 2D score maps for landmarks definition. Finally, 3D hand poses were estimated from these points by another network trained a priori on 3D data. A multi-view training method  used Convolutional Pose Machines (CPM)  as a detector and achieved similar performance to RGB-D systems. In , two stages determine the network’s functionality. Firstly, the hand mask prediction took place, while the pose prediction followed. For a given RGB image, the hand bounding-box and landmarks were obtained simultaneously through a single forward pass of an encoder-decoder network in . Afterwards, the hand region was used as feedback to determine whether a cycle detection was performed. The monocular 3D hand tracking performed well even in challenging occlusion scenarios, through the synthetic generation of the training data . Cai et al.  adopted a weakly-supervised pipeline trained on RGB-D images and predicted 3D joints from the incoming color frames. Utilizing a CNN for depth and heatmap estimation, 2.5D poses are obtained in . Adaptive Graphical Model Network (AGMN)  integrated a CNN with a graphical model for accurate 2D estimation. Similarly, a Rotation-invariant Mixed Graph Model Network (R-MGMN) performed 2D gesture estimation through a monocular RGB camera . The authors in  proposed the InterHand2.6M dataset and InterNet for cases where two hands are used. Boukhayma1 et al.
introduced the first end-to-end deep learning method, which was formed by an encoder-decoder structure placed in series. 2D landmark information was utilized to facilitate the learning of a 3D hand pose. A graph CNN-based method was proposed to estimate 3D hand joints and reconstructed its 3D mesh of the surface in .
Iii Proposed Method
An overview of the proposed pipeline is outlined in Fig. 2. The input image is fed into the hand detection network to get the corresponding bounding box. We obtain a stabilized box through the information provided by the previous boxes. Afterwards, this ROI is placed on the landmark localization network for computing the heatmaps. Finally, D positions of the hand landmarks are extracted.
Iii-a Hand Detection and Stablization
Iii-A1 Hand Detection
top-down approaches firstly detect the ROI, i.e., the hand, in the visual sensory information, and then localize its landmarks on the chosen area. In this way, a computational complexity reduction is achieved. We select to utilize the MobileNet-SSD framework to detect human hands, which is based on MobileNetV2  as the network’s backbone and Single Shot multi-box Detector (SSD)  for the detection.
Iii-A2 Stabilization of the Bounding-Box
as we adopt the top-down approach, the detection box’s inevitable jitters occur affecting the landmarks’ localization. Aiming for more accurate prediction scores, we added a fast and straightforward method to stabilize the hand bounding-box. Since early perceived frames present a higher correlation with the current box, we perform a weighted average according to the exponentially decreasing weight. This way, we exploit the correlation among the current box and the previous ones as the weight’s value is higher for neighbouring images. The current bounding-box coordinates are calculated as:
where represents the previous frame’s index (e.g. denotes the current frame, while is the previous one) and
corresponds to the multitude of the examined frames. This value was set heuristic to 6 (= 6). are the coordinates of the bounding-box. Finally, the last term of Eqn. 1, i.e., , is the exponentially decreasing weight.
Iii-B Landmark Localization
For the landmarks’ detection, we design a heatmap regression method following the encoder-decoder architecture outlined in Fig. 3. At first, we resize the image obtained from the previous steps to , and then we feed it to the proposed network aiming to extract high-level information. The feature map resolution is restored to through the decoder to generate the heatmaps, which are processed to extract the 2D coordinates of the landmarks.
Iii-B1 The Encoder Network
utilizes convolution, skip connection , and depthwise separable convolutions , since we aim to a lightweight solution. The network has deep layers and preserves a small number of parameters while learning more high-level information. Three parts define its structure, namely the Low, the Middle, and the High (Fig. 3 (a)). The network structure is the same as ‘MediaPipe Hands” . More specifically, low-level features are extracted via the first part, which is composed of repeated residual blocks () and downsampling operation. In the Middle part, the utilization of parameters is improved through an encoder-decoder network and skip connection. In the encoder, a downsampling operation is utilized to reduce the resolution with residual blocks repeated four times, as illustrated in Fig. 3 (b). Then in the decoder part, the resolution can be improved through the upsampling module which consists of residual blocks, a convolution, and a resize operation, as shown in Fig. 3 (c). At last, in the High part, more high-level information is extracted while the resolution is reduced by the downsampling module, as shown in Fig. 3 (b).
Iii-B2 The Decoder Network
aims to restore the resolution of the feature map. Since the human hand presents a different structure from the face or other human parts, presenting higher degrees of freedom, we improve the feature map resolution through a deconvolution () performed to preserve spatial position information. This technique offers a better modelling capability than a direct, coordinate regression and can provide improved accuracy. When heatmaps are used as ground truth to predict the landmarks, they must have high resolution. In our method, the encoder’s output is . As a result, we use the decoder to increase the resolution to . This way, we manage to retain more position-related information. Three layers of deconvolutions are used to restore the feature map size. The proposed pipeline outputs heatmaps. Finally, the 2D coordinates of the 21 hand landmarks are derived through the extraction of the peak points at each channel.
In this section, extensive experiments are conducted to demonstrate the effectiveness of FastHand. Due to the lack of the 2D hand pose dataset, we generate a large scale dataset using image-sequences from YouTube3D Hands . Two widely-used datasets, STB  and RHD 
, are selected for the method’s evaluation. Subsequently, we introduce the experimental settings, which include the training parameters and evaluation metrics. Finally, we provide quantitative and qualitative results to fully demonstrate the overall performance of our method.
Iv-a 2D Hand Pose Dataset Generation
There are several limitations of the existed 2D hand pose dataset. The datasets, such as STB , MHP , HO3D , only have limited scenes and hand movements. On the other hand, synthetic datasets are always unreal. It is difficult to collect and annotate a 2D hand pose because it has high degrees of freedom, self-occlusion, and mutual occlusion.
We generate a 2D hand pose dataset based on the YouTube3D Hands  dataset, which is a large-scale image-sequence including hand actions. It provides 109 videos in total. 102 of them are the training sets, while the rest are the test ones. As this dataset offers the 3D vertex coordinates of hands, we need to extract the 2D information of the hand landmarks. Through the average of 10 vertices around each hand skeleton, we get the 3D coordinates of the 21 hand landmarks. Subsequently, we project them onto the 2D plane to obtain the 2D points. The generated 2D hand pose dataset is referred to as YouTube2D Hands, and the generation method is publicly available111https://github.com/AnshanTJU/Youtube3DHands-2D aiming to facilitate future studies. As a final note, the images are resized to and augmented by 10 times using random scaling and cropping. Besides, we use the GANeratedHands dataset  for the network’s training since through the utilization of real-world images we cannot cover all gestures, and the annotation in real data is always inaccurate. A brief description of each dataset is shown in Table I.
|Dataset||Usage||Description||Image resolution||# Joints||# Images|
|Test set of STB ||Test||Real-world||21||6000|
|Test set of RHD ||Test||Synthetic||21||2727|
|Dataset||Metric||Using YouTube2D Hands||Using GANeratedHands ||Using Both Datasets|
means the higher the better, while means the lower the better.
|Dataset||Metric||SRHandNet ||NSRM Hand ||InterHand  trained on STB||InterHand  trained on RHD||Our Method|
means the higher the better, while means the lower the better.
Iv-B Experimental Settings
Iv-B1 Training Process
took place on four NVIDIA Tesla P40 GPU. The entire network was trained for a total of 50 epochs, while we decreased the learning rate exponentially after the first three ones. The initial learning rate was set toand the batch size to 64. The Adam optimizer 
was utilized as well. Finally, the network’s implementation was based on TensorFlow.
Iv-B2 Evaluation Metrics
Sum of Squares Error (SSE), end-point error (EPE), and Probability of Correct Keypoint (PCK) within a normalized distance threshold, are used as evaluation metrics:
where the term is the ground truth of landmark and is the system’s predicted coordinate. represents the landmark indexes and represents the index of hand sample. denotes the number of samples in the dataset, while and are the original images’ width and height, respectively. In Eq. 4, is the indicator function and is the threshold. If distance between the predicted landmark and the ground truth is less than , the indicator function is set to 1, otherwise to 0. represents the metric of landmark at the ratio of .
Iv-C The Impact of Using Different Training Set
At first, we study the impact of using different training datasets in the proposed network. As shown in Table II, the system’s performance improves when FastHand is also trained in YouTube2D Hands. This way, when training our model in both datasets, we achieve better performance on STB and RHD. The results demonstrate that the training set plays an important role, while our generated dataset contributes to the improvement of the framework accuracy.
Iv-D Comparison With The State-of-the-art Techniques
We compare the proposed pipeline against three state-of-the-art methods, namely SRHandNet , NSRM Hand , and InterHand . SRHandNet is trained on OneHand10K, NSRM Hand on OneHand10K and CMU Panoptic Hand , while InterHand is trained on the STB  and the RHD 
datasets. As no open-source implementations were available for these networks, we utilize their trained models for comparison. As shown in TableIII, InterHand  achieves the expected performance when evaluated on STB and RHD. This demonstrates the significance of the training set for the system’s overall performance, as the same data distribution causes the accurate prediction of hand landmarks. However, FastHand follows InterHand even though it is not trained on STB and RHD. This exhibits our method’s generalization compared with the other approaches.
In Tables IV and V, we compare the execution times and the networks’ size, respectively. As one can observe, our model is the smallest among the rest, having a size of 13.0 MB, which is less than 1/3 of SRHandNet. Furthermore, we list the inference run-time for different methods showing that FastHand can execute at 25FPS on an NVIDIA Jetson TX2 GPU and 31FPS on an NVIDIA GeForce 940 MX GPU of a laptop outperforming each other method.
Iv-E Qualitative Evaluation
We show the visualization results of the proposed pipeline in Fig. 4. Different perspectives of the hand pose are shown in each column. Our framework can accurately detect the landmarks of open hands, such as “Num 5”, while performs favourably in cases where occlusions occur, such as the fingertips for the pose of “thumbs-up” and “claw”. The high-accuracy hand landmark localization is due to our decoder design, which extracts high-resolution feature maps. In our experiments, we found that our method is able to recognize the hand gestures of numbers to and other basic ones, including the self-occluding poses as well. In Fig. 5 and Fig. 6, we show the results obtained by FastHand and the other methods on the STB and RHD datasets. As can be seen, our network predicts accurate coordinates outperforming the rest of the approaches.
In this paper, a fast 2D hand pose estimation pipeline is proposed, under the title “FastHand”. The core of our framework is based on a lightweight encoder-decoder network. Through hand detection and stabilization, the ROI is cropped and fed into our network for landmark localization. The proposed network is trained over our large-scale 2D hand pose dataset and a synthetic one while evaluated on two publicly-available datasets. In comparison to several state-of-the-art techniques, the proposed approach shows an improved performance while can execute in real-time on a low-cost GeForce 940MX GPU and an NVIDIA Jetson TX2 GPU (over 25 frames per second).
Tensorflow: a system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283. Cited by: §IV-B1.
-  (2019) 3d hand shape and pose from images in the wild. In , pp. 10843–10852. Cited by: §II-B.
-  (2004) Face tracking and hand gesture recognition for human-robot interaction. In Proc. IEEE Int. Conf. Robotics and Automation (ICRA), Vol. 2, pp. 1901–1906. Cited by: §I.
-  (2018) Weakly-supervised 3d hand pose estimation from monocular rgb images. In Proc. Eur. Conf. Computer Vision (ECCV), pp. 666–682. Cited by: §I, §II-B.
-  (2019) Improved optical flow for gesture-based human-robot interaction. In Proc. IEEE Int. Conf. Robotics and Automation (ICRA), pp. 7983–7989. Cited by: §I.
-  (2020) Nonparametric structure regularization machine for 2d hand pose estimation. In IEEE Winter Conf. Applications of Computer Vision (WACV), pp. 381–390. Cited by: §I, §II-A, §IV-D, TABLE III, TABLE IV, TABLE V.
-  (2017) Xception: deep learning with depthwise separable convolutions. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 1251–1258. Cited by: §I, §III-B1.
-  (2015) A kinect-based gesture recognition approach for a natural human robot interface. Int. J. Advanced Robotic Systems 12 (3), pp. 22. Cited by: §I.
-  (2018) Deep learning for hand gesture recognition on skeletal data. In 13th IEEE Int. Conf. on Automatic Face & Gesture Recognition (FG), pp. 106–113. Cited by: §I.
-  (2016) A human-robot interaction interface for mobile and stationary robots based on real-time 3d human body and hand-finger pose estimation. In IEEE 21st Int. Conf. Emerging Technologies and Factory Automation (ETFA), pp. 1–6. Cited by: §I.
-  (2018) Robust 3d hand pose estimation from single depth images using multi-view cnns. IEEE Trans. Image Processing 27 (9), pp. 4422–4436. Cited by: §I.
-  (2019) 3d hand shape and pose estimation from a single rgb image. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 10833–10842. Cited by: §I, §II-B.
-  (1999) The role of gesture in communication and thinking. Trends in cognitive sciences 3 (11), pp. 419–429. Cited by: §I.
-  (2019) Large-scale multiview 3d hand pose dataset. Image and Vision Computing 81, pp. 25–33. Cited by: §IV-A.
-  (2019) HO-3D: a multi-user, multi-object dataset for joint 3d hand-object pose estimation. arXiv preprint arXiv:1907.01481. Cited by: §IV-A.
-  (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §I, §III-B1.
-  (2018) Hand pose estimation via latent 2.5 d heatmap regression. In Proc. Eur. Conf. Computer Vision (ECCV), pp. 118–134. Cited by: §I, §II-B.
-  (2017) Panoptic studio: a massively multiview system for social interaction capture. IEEE Trans. Pattern Analysis and Machine Intelligence 41 (1), pp. 190–204. Cited by: §IV-D.
An active learning paradigm for online audio-visual emotion recognition. IEEE Trans. Affective Computing. Cited by: §I.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-B1.
-  (2019) Adaptive graphical model network for 2d handpose estimation. arXiv preprint arXiv:1909.08205. Cited by: §II-B.
-  (2020) Rotation-invariant mixed graphical model network for 2d hand pose estimation. In IEEE Winter Conf. Applications of Computer Vision (WACV), pp. 1546–1555. Cited by: §II-B.
-  (2020) Weakly-supervised mesh-convolutional hand reconstruction in the wild. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 4990–5000. Cited by: §IV-A, §IV.
-  (2020) Decoupled representation learning for skeleton-based gesture recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 5751–5760. Cited by: §I.
-  (2016) SSD: single shot multibox detector. In Proc. Eur. Conf. Computer Vision (ECCV), pp. 21–37. Cited by: Fig. 2, §III-A1.
-  (2018) Towards real-time physical human-robot interaction using skeleton information and hand gestures. In Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), pp. 1–6. Cited by: §I.
-  (2020) InterHand2.6M: a dataset and baseline for 3d interacting hand pose estimation from a single rgb image. arXiv preprint arXiv:2008.09309. Cited by: §II-B, §IV-D, TABLE III, TABLE IV, TABLE V.
-  (2018) Ganerated hands for real-time 3d hand tracking from monocular rgb. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 49–59. Cited by: §I, §II-B, §IV-A, TABLE I, TABLE II.
A neural network based on spd manifold learning for skeleton-based hand gesture recognition. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 12036–12045. Cited by: §I.
Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recognition 76, pp. 80–94. Cited by: §I.
-  (2011) Efficient model-based 3d tracking of hand articulations using kinect.. In The British Machine Vision Conference (BMVC), Vol. 1, pp. 3. Cited by: §I.
-  (2013) Robust part-based hand gesture recognition using kinect sensor. IEEE Trans. Multimedia 15 (5), pp. 1110–1120. Cited by: §I.
-  (2018) MobileNetV2: inverted residuals and linear bottlenecks. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520. Cited by: Fig. 2, §III-A1.
-  (2020) Attention! A Lightweight 2D Hand Pose Estimation Approach. IEEE Sensors Journal. Cited by: §I, §II-A.
-  (2017) Hand keypoint detection in single images using multiview bootstrapping. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 1145–1153. Cited by: §I, §II-B.
-  (2015) Robust articulated-icp for real-time hand tracking. In Computer Graphics Forum, Vol. 34, pp. 101–114. Cited by: §I.
-  (2019) Self-supervised 3d hand pose estimation through training by fitting. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 10853–10862. Cited by: §I.
-  (2018) Mask-pose cascaded cnn for 2d hand pose estimation from single color image. IEEE Trans. Circuits and Systems for Video Technology 29 (11), pp. 3258–3268. Cited by: §II-B.
-  (2019) SRHandNet: real-time 2d hand pose estimation with simultaneous region localization. IEEE Trans. Image Processing 29, pp. 2977–2986. Cited by: §II-B, §IV-D, TABLE III, TABLE IV, TABLE V.
-  (2016) Convolutional pose machines. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 4724–4732. Cited by: §II-B.
-  (2020) MediaPipe hands: on-device real-time hand tracking. arXiv preprint arXiv:2006.10214. Cited by: §II-A, §III-B1.
-  (2016) 3d hand pose tracking and estimation using stereo matching. arXiv preprint arXiv:1610.07214. Cited by: Fig. 5, §IV-A, §IV-D, TABLE I, TABLE II, TABLE III, §IV.
-  Learning to estimate 3d hand pose from single rgb images. In Proc. IEEE Int. Conf. Computer Vision (ICCV), pp. 4903–4911. Cited by: §I, §II-B, Fig. 6, §IV-D, TABLE I, TABLE II, TABLE III, §IV.
-  (2019) FreiHAND: a dataset for markerless capture of hand pose and shape from single rgb images. In Proc. IEEE Int. Conf. Computer Vision (ICCV), pp. 813–822. Cited by: §I, §II-A.