LE-HGR: A Lightweight and Efficient RGB-based Online Gesture Recognition Network for Embedded AR Devices

by   Hongwei Xie, et al.

Online hand gesture recognition (HGR) techniques are essential in augmented reality (AR) applications for enabling natural human-to-computer interaction and communication. In recent years, the consumer market for low-cost AR devices has been rapidly growing, while the technology maturity in this domain is still limited. Those devices are typical of low prices, limited memory, and resource-constrained computational units, which makes online HGR a challenging problem. To tackle this problem, we propose a lightweight and computationally efficient HGR framework, namely LE-HGR, to enable real-time gesture recognition on embedded devices with low computing power. We also show that the proposed method is of high accuracy and robustness, which is able to reach high-end performance in a variety of complicated interaction environments. To achieve our goal, we first propose a cascaded multi-task convolutional neural network (CNN) to simultaneously predict probabilities of hand detection and regress hand keypoint locations online. We show that, with the proposed cascaded architecture design, false-positive estimates can be largely eliminated. Additionally, an associated mapping approach is introduced to track the hand trace via the predicted locations, which addresses the interference of multi-handedness. Subsequently, we propose a trace sequence neural network (TraceSeqNN) to recognize the hand gesture by exploiting the motion features of the tracked trace. Finally, we provide a variety of experimental results to show that the proposed framework is able to achieve state-of-the-art accuracy with significantly reduced computational cost, which are the key properties for enabling real-time applications in low-cost commercial devices such as mobile devices and AR/VR headsets.



page 1

page 2

page 5


Hand-Gesture-Recognition Based Text Input Method for AR/VR Wearable Devices

Static and dynamic hand movements are basic way for human-machine intera...

Synthetic Video Generation for Robust Hand Gesture Recognition in Augmented Reality Applications

Hand gestures are a natural means of interaction in Augmented Reality an...

GestARLite: An On-Device Pointing Finger Based Gestural Interface for Smartphones and Video See-Through Head-Mounts

Hand gestures form an intuitive means of interaction in Mixed Reality (M...

Egocentric Gesture Recognition for Head-Mounted AR devices

Natural interaction with virtual objects in AR/VR environments makes for...

EdgeXAR: A 6-DoF Camera Multi-target Interaction Framework for MAR with User-friendly Latency Compensation

The computational capabilities of recent mobile devices enable the proce...

Dynamic Hand Gesture Recognition for Wearable Devices with Low Complexity Recurrent Neural Networks

Gesture recognition is a very essential technology for many wearable dev...

Securing Fleets of Consumer Drones at Low Cost

In recent years, the use and suitability of drones for many applications...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Works

1.1 Visual Recognition Methods for Hand Detection

One of the most well known visual recognition algorithm series is the R-CNN families[girshick2015fast, ren2015faster]

, which generate potential bounding boxes via region proposal methods and run a classifier on these proposed boxes. However, these methods require a large amount of computational resources, making themselves infeasible for low-cost embedded devices.

Unlike region proposal-based techniques, Single shot multi-box detector (SSD) [DBLP:journals/corr/LiuAESR15] produced predictions on multi-scale feature maps and achieves competitive accuracy even with the relatively low-resolution input. To enable convolutional neural network in real-time on low-cost devices, Howard et.al [howard2017mobilenets] proposed MobileNet network, which is based on depth-wise separable convolutions to compress the number of parameters.

1.2 Gesture Recognition using Spatial-temporal Features

1.2.1 HGR via the RGB-D based Approachs

RGB-D cameras have the natural advantage of action recognition due to its capability of directly capturing 3D dense point cloud data. In recent years, a variety of methods have been successfully applied to the 3D skeleton-based action recognition problems. [zhang2019view, zhao2018skeleton, liu2018skeleton, xie2018memory, ohn2014hand, hou2018spatial]. Smedt et al. [de20173d, zhao2018skeleton]

encoded the temporal pyramid information using the 3D skeleton-based geometric features and performed a linear support vector machine (SVM) to achieve a classification task. Chen

et al. [chen2017motion]

proposed a motion feature augmented recurrent neural network (RNN) for skeleton-based HGR.

1.2.2 Gesture Recognition using the Sequential Video Sequences

A large number of 3D-CNN-based methods [wang2018human, miao2017multimodal, tran2015learning, gupta2016online] have been developed for producing better performance on various video analysis tasks. Them, ConvLSTM [xingjian2015convolutional, zhang2017learning, zhu2017multimodal, wang2017large] combined the 3DCNN and LSTM to effectively learn different levels of spatial-temporal characteristics. CLDNN [sainath2015convolutional] analyzed the effect of adding CNN layers to LSTM and LRCN [donahue2015long] extended the convolutional LSTM for visual recognition problems. Besides, multiple-stream based methods have been proposed for tackling the multiple modality data [nishida2015multimodal, simonyan2015two], which separately learn the spatial features from video frames and the temporal feature from the dense optical flow for action recognition.

2 Methodology

2.1 Framework Overview

In this section, we illustrate the overall pipeline of our approach. As shown in LE-HGR: A Lightweight and Efficient RGB-based Online Gesture Recognition Network for Embedded AR Devices, our LE-HGR mainly consists of four stages: detecting hand candidates from input images, refining detection results and regressing hand keypoints using a multi-task network, trace mapping for addressing multi-hand ambiguities, and gesture recognition from the hand-skeleton sequence. At the first stage, the hand bounding boxes are detected via an ultra-simplified SSD [liu2016ssd] network. Since this lightweight detector is inevitable to produce false-positive detection results, we propose a cascaded multi-task network to refine those results, as well as regress the hand keypoints in parallel. To effectively handle the situation when multiple hands appearing simultaneously in input images, we maintain the temporal traces via associating the predicted hands to the existed hand motion trails. Finally, hand keypoints in the temporal trace are stacked as the sequential input of the TraceSeqNN to recognize the gesture.

2.2 Joint Hand Detection and Keypoints Regression using the Multi-task Cascaded Network

In this section, we describe the architecture and training objective of our multi-task network, which is also shown in Figure 1.

2.2.1 Online Hand Candidate Detector

Detector Network Architecture. In this stage, the SSD [DBLP:journals/corr/LiuAESR15] is chosen as our candidate detector to achieve the balance between speed and accuracy. The MobileNet [howard2017mobilenets] is used as the backbone CNN network and the width multiplier (WM) is set to be 0.25 [howard2017mobilenets]

to improve the inference speed. The multi-scale pyramid feature maps are produced by the max pooling operation similar to PPN

[DBLP:journals/corr/abs-1807-03284], for reducing parameters.

Training Loss of the Candidate Detector.

The overall loss function is a weighted sum of the confidence loss and the regression loss


where is the regression loss of bounding boxes following the concept of [girshick2015fast, DBLP:journals/corr/LiuAESR15], and the confidence loss is the cross-entropy loss with the ground-truth label and the predicted candidate probability

2.2.2 Hand Detection and Keypoints Regression using the Multi-task CNN

Multi-task Network Architecture. A multi-task CNN is used as the validation approach to reject erroneous candidates and regress the hand keypoints at the same time. Two branches of fully connected layers (FC) are added to the end of the network for the hand confidence prediction and keypoints regression separately.

Training Loss of the Multi-task Network. The complete loss function of the multi-task network is composed of the cross-entropy loss over binary classification and keypoints Euclidean loss as


where is the corresponding coordinate of hand keypoint and is the predicted probability of hand.

2.2.3 Hand Trace Mapping via the Predicted Locations

In the case of HGR system used in complicated environments, ambiguity caused by multiple hands is normal rather than incidental. While other methods [gupta2016online] often ignore this problem, we here introduce hand trace mapping to solve this problem.

Hand Trace Mapping. Hand mapping approach is used to match the predicted hand of the current frame with the maintained traces using the current bounding box and keypoints. We adopt the match loss to measure the similarity of the current hand and existing traces , which combines the intersection over union (IOU) and keypoint locations. The overall loss function of the matching phase is defined as


where IoU loss and the region area loss of hand bounding box indicate the regional similarity. The Euclidean loss is utilized to measure the similarity of keypoint positions.

Temporal Motion Features. Once a detected hand is associated with the trace , the temporal motion features of this tracked hand can be directly computed using the location vector as follows:


where denotes the velocity of the tracked hand. indicates the edge vector of hand skeleton, where is the edge set. describes the hand shape information.

2.3 HGR using Temporal Motion Features

2.3.1 Online Recognition based the TraceSeqNN

As depicted in Figure 2, the proposed TraceSeqNN is primarily consisted of transformer, LSTM block and FC block. Given timestep , the sequential input is fed into a Transformer to reshape the motion features for the LSTM layers [hochreiter1997long].

In temporal movements, the hand shape and the velocity of keypoints demonstrate different characteristics. We perform velocity branch and shape branch to learn the velocity features and hand shape features

separately. The outputs of these two branches are further stacked into the FC layers, the probability of gesture categories is finally predicted via the softmax layer.

2.3.2 Sequence Sample Generation and Augmentation

Sample Generation Principle. Given the start and end timestamps of annotated segments in the captured video, we can generate the sample set for training and test phases by clipping the original video with an objective timestep .


Model Backbone Our Dataset Nvidia
(MTK8167 CPU)
Recall Precision Recall Precision


Table 1: The hand detection results. The top two rows show the impact of image input size. The proposed cascaded network is effective in improving the performance for low-resolution input


Models Dataset Dataset Resolution Error FPS MacProI7-CPU
Mean Median

Ours keypoints Regression
Nvidia [gupta2016online]
Our dataset
Nvidia [gupta2016online]
Our dataset


Table 2: Evaluation the regression errors on public dataset. We conduct experiments on public datasets to compute the error metrics for keypoint regression.


Input Recall Precision Accuracy False Positive
(MTK8167 CPU)
neg lwave rwave neg lwave rwave
TDNN [waibel1995phoneme])
TDNN [waibel1995phoneme]
TDNN [waibel1995phoneme]
LSTM [hochreiter1997long]
CLDNN [sainath2015convolutional]
LRCN [donahue2015long]


Table 3: Evaluation the TraceSeqNN for HGR on our dataset. Comparing our models against different state-of-the-art approaches on the same sequence dataset, which consists of left waving, right waving, and negative samples. Taking our temporal motion features as input, the HGR can run at 250 fps on low-cost MediaTek MTK-8167S processor.

In order to distinguish positive and negative samples, we compute the labels of these samples according to the similarity of these clipped samples and their corresponding original segments. Firstly, we represent the similarity as the intersection over a sample and the intersection over its annotation


where are the start and end timestamps of the clipped sequence in the original video.

Subsequently, the label of this clipped sample can be computed according to the similarity metric


where is the annotated category in the original video. and are the thresholds of the sample principle.

Data Augmentation for the Temporal Domain. In order to generate massive hand motion samples of different temporal domains using the limited-quantity annotated segments, we propose a novel data augmentation method to increase the diversity of temporal domains. Firstly, the sequence samples are yielded via the timestep set:


where is the minimum time step and is configured to a constant value.

Subsequently, using the nonlinear transformation, i.e. interpolation or down-sampling operations, the generated samples are further re-sampled and deformed to the target timestep


3 Experiments

3.1 Dataset

Method Modality Accuracy
Spatial stream CNN [simonyan2015two]
C3D [molchanov2015hand]
R3DCNN [gupta2016online]
Proposed-reg, Data-Aug
Proposed-reg, Data-Aug
Proposed-cpm, Data-Aug
Table 4: Comparison with the state-of-the-art methods on the pubilshed Nvidia dataset. Proposed-reg uses the multi-task CNN to regress the keypoints, while proposed-cpm adopts the CPM-1Stage network to predict the keypoint locations. TraceSeqNN-SB uses the single branch to learn the motion features.

3.1.1 Public Nvidia Gesture Dataset

Nvidia corporation published a dataset of 25 gesture types intended for touchless interfaces in cars[gupta2016online]. The dataset consists of a total of 1050 training and 482 testing video sequences, covering both bright and dim artificial lighting. In order to validate the effectiveness of our proposed algorithm on Nvidia dataset, we also add annotations for the keypoints and bounding boxes of this dataset during the training phase. All new annotations will be publicly available.

3.1.2 Our Dataset

In addition to testing in the public Nvidia dataset, we also build our own dataset in a crowd-sourcing way. This dataset consists of more than 150k RGB images in different environments conditions (e.g., different background, lighting conditions, and so on) via different RGB cameras, which contains 120k training images and 30k testing. In addition, we recorded 30000 video sequences using different RGB cameras at 30fps.

3.2 Evaluation of Multi-task Cascaded Network for Hand Detection

The proposed cascaded multi-task network is adopted to predict the hand confidence and regress the keypoint locations. In this section, we conduct experiments for quantitatively evaluating the metrics of our architecture design, based on an hand-held low-cost interactive device. The main processor on this device is MediaTek MTK-8167S.

3.2.1 Training Details

To train the proposed network, we annotated 8867 ground-truth bounding boxes for Nvidia dataset and 90000 ground-truth bounding boxes for our dataset. The candidate detector was trained with the annotated datasets. Subsequently, for the multi-task CNN, we generated the positive samples with IOU over 0.5 and negative samples with IOU less than 0.3 via the candidate detector model. The hand keypoints of the positive samples are annotated only for the training phase of multi-task CNN.

3.2.2 Evaluation the Multi-task Cascaded Network

We compared the proposed network against standard SSD to evaluate the accuracy on previously mentioned datasets. For fair comparison, we chosen the MobileNetV1 as the backbone of our detector and standard SSD. Results from Table 1 show that the detection results are drastically degraded with SSD when using low-resolution input. However, the proposed network greatly improves both detection recall and precision with only negligible computational cost.

We also measured the error metrics on our dataset and Nvidia [gupta2016online] dataset to calculate the precision of regression. The results are shown in Table 2. Furthermore, we implemented a one-stage convolutional pose machine (CPM-1Stage) [wei2016convolutional] using the MobileNet as the backbone.

3.3 Evaluation the TraceSeqNN and Data Augmentation

3.3.1 Training Details

For data augmentation of the training set, we set the objective timestep to be 13 according to the distribution of and . The minimum timestep is set to be 8 and is 5. Then, and are set to be 0.3 and is 0.85 for calculating labels of the clipped samples. During the training phase of TraceSeqNN, the initial learning rate is set to be 0.004. The number of LSTM hidden layers is set to be 64. We train all the models using Adam Optimizer [kingma2014adam] with the dropout rate of LSTM cells at 0.2.

3.3.2 Metrics on Our Dataset using TraceSeqNN

We designed comprehensive experiments to evaluate different input channels, different data augmentation methods, and different network architectures (i.e., TDNN [waibel1995phoneme] , LSTM [hochreiter1997long], CLDNN [sainath2015convolutional] and LRCN [donahue2015long]) of our TraceSeqNN. The results are shown in Table 3. The results demonstrate that our data augmentation method can significantly improve the overall prediction accuracy. Compared with , by using motion features as the network input, false positives can be largely eliminated. To summarize, the proposed TraceSeqNN achieves the highest accuracy for HGR.

Figure 3: Confusion matrix of 25 gesture types.Confusion matrix on the total 482 Nvidia testset.
Figure 4: Examples of failure cases. We show failure cases due to ambiguity of hand gesture or overlapping between gestures, e.g. ”Two fingers left” is part of ”Rotate fingers CW” gesture.

3.4 Evaluation on the Public Datasets for HGR

Table 4 shows the results by comparing our approach against competing state-of-the-art networks on the Nvidia dataset. As some of the ops is not supported on the hand-held low-cost device, for ensuring fair comparison, all the experiments were performed on a MacBook Pro core-i7 computer by only using CPU and 16GB memory. Table 4 demonstrates that C3D [molchanov2015hand] and R3DCNN [gupta2016online] rely heavily on computational resources, which can only run at fps.

For the proposed TraceSeqNN, benefiting from the designed lightweight cascaded framework, our LE-HGR significantly improves the inference speed at fps. In addition, the proposed approach achieves the highest accuracy over competing methods using temporal motion features. Adopting our proposed data augmentation, we have seen an increase of on this challenging gesture dataset. Our final accuracy is by only using the RGB modality dataset.

The complete confusion matrix is shown in Figure 3. If equipped with a more precise keypoints predictor (CPM-1Stage), we can get a improvement. we show some failure cases in Figure 4

3.5 Typical AR applications

In general, AR devices have limited user interfaces, most often small buttons or touchscreen, these interfaces are not natural for human and destroy AR immersion experiences.

In order to enrich the AR experiences, we design an online AR interacting system via direct hand touch and gesture sensing. As shown in the Figure 5, by locating and tracking finger position, it can understand the user’s intention (e.g. ”clicking”, ”draw star”, and so on). Obviously, by applying this system to educational and entertainment applications, we can provide more incredible and interesting AR experiences. We present a video to show these AR application demos in our supplementary material.

Figure 5: Typical AR application. The left column shows that finger track and gesture recognition (”click”) via our LE-HGR approach. The right column shows the application of AR education.

4 Conclusion

We have presented a complete online RGB-based gesture recognition framework, which is extremely lightweight and efficient for low-cost commercial embedded devices. An improved multi-task cascaded network is introduced for online hand detection and a prototype of the hand trace mapping is implemented to tackle the interference of multi-hand. Besides, a TraceSeqNN network combined with an effective data augmentation method is performed to learn the temporal information and recognize the gesture categories.

The experimental results show that our proposed cascaded multi-task network significantly improves the recall and precision rates of hand detection. The designed LE-HGR framework achieves advanced accuracy with significantly reduced computational complexity. To extend this work, we are interested in investigating this challenging issue directly to predict the 3D hand keypoint using only RGB-cameras. Moreover, a promising future research direction is to explore the attention mechanism for RGB-based HGR applications.