In the past few years, several datasets have been presented to advance the state-of-the-art in computer vision core problems for the automatic analysis of humans, including human pose, action/gesture[17, 16, 2, 12, 42], and face[27, 15, 25, 34, 46]. Interestingly, there has been rather limited work focusing on large-scale RGB-D gesture recognition because of the lack of annotated data for that purpose.
To this end, we present two large-scale datasets with RGB-D video sequences, namely, the ChaLearn isolated gesture dataset (IsoGD) and continuous gesture dataset (ConGD) for the tasks of isolated and continuous gesture recognition, respectively. Both of them consist of more than 47 thousands gestures fallen into classes performed by performers. The IsoGD dataset has video clips (one gesture per video) whereas the ConGD dataset has clips owing to some continuous gestures existing in one video. Specifically, we developed software to help complete the annotation efficiently. That is, we first used the dynamic time wrapping (DTW) algorithm  to approximately determine the start and end points of the gestures, and then manually adjusted the start and end frames of each gesture accurately. Then, we organized two ChaLearn Large-scale Gesture Recognition Challenge workshops in conjunction with the ICPR 2016 , and ICCV 2017 . The datasets allowed for the development and comparison of different algorithms, and the competition and workshop provided a way to track the progress and discuss the advantages and disadvantages learned from the most successful and innovative entries.
We analyze and review the published papers focusing on large-scale gesture recognition based on IsoGD or ConGD datasets, introduce a new CSR metric evaluation and propose a baseline method for temporal segmentation. Specifically, instead of deploying unreliable prior assumptions [41, 9, 67] and handling each video frame separately , we design a new temporal segmentation algorithm to convert the continuous gesture recognition problem to the isolated one, by using the bidirectional long short-term memory (Bi-LSTM) to determine the video division points based on the skeleton points extracted by the convolutional pose machine (CPM) [8, 55, 69] in each frame.
In addition, we also discuss and analyze the achieved results and propose potential directions for future research. We expect the challenge to push the progress in the respective fields. The main contributions of this work are summarized as follows:
We discuss the challenges of creating two large-scale gesture benchmark datasets, namely, the IsoGD and ConGD, and highlight developments in both isolated and continuous gesture recognition fields by creating the benchmark and holding the challenges. We analyze the submitted results in both challenges, and review the published algorithms in the last three years.
A new temporal segmentation algorithm named the Bi-LSTM segmentation network is proposed, which is used to determine the start and end frames of each gesture in the continuous gesture video. Compared with existing methods for temporal segmentation, the main advantage of the proposed method is to avoid the need for prior assumptions.
A new evaluation metric named corrected segmentation rate (CSR) is introduced and used to evaluate the performance of temporal segmentation. Compared with the published methods, the proposed Bi-LSTM method improves state-of-the-art results. The superiority of temporal segmentation is about 8.1% (from 0.8917 to 0.9639) by CSR on the testing sets of the ConGD dataset.
The rest of this paper is organized as follows. We describe datasets, evaluation metrics and organized challenges in Section II. In Section III, we review the state-of-the-art methods focusing on both datasets, and analyze the results in detail. We propose a new algorithm for temporal segmentation in Section IV and present experimental results on the two proposed datasets in Section V. Finally, we conclude the paper and discuss some potential research directions in the field in Section VI.
|Dataset||Total||Gesture||Avg. samp.||Train samp.|
|gestures||labels||per cls.||(per cls.)|
|CGD , 2011||540,000||200||10||812 (1-1-1)|
|Dataset , 2013|
|ChAirGest , 2013||1,200||10||120||-|
|Dataset , 2013||-|
|Dataset , 2017|
|Sets||the IsoGD dataset||the ConGD dataset|
|#labels||#gestures||#RGB-D videos||#subjects||#labels||#gestures||#RGB-D videos||#subjects|
Ii Dataset Introduction and Challenge Tasks
Benchmark datasets can greatly promote the research developments in their respective fields. For example, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is held every year from 2010 to 2017, which includes several challenging tasks, including image classification, single-object localization, and object detection. The dataset presented in this challenge contains object classes with approximate million training images, thousand validation images and thousand testing images, which greatly promotes the development of new techniques, particularly those based on deep learning architectures, for image classification and object localization. Several other datasets have been also designed to evaluate different computer vision tasks, such as human pose recovery, action and gesture recognition, and face analysis [18, 13, 14, 52].
Nevertheless, there are very few annotated datasets with a large number of samples and gesture categories for the task of RGB-D gesture recognition. Table I lists the publicly available RGB-D gesture datasets released from 2011 to 2017. Most datasets include less than gesture classes (e.g., [50, 40]). Although the CGD dataset  has about thousand gestures, it is designed for one-shot learning task (only one training sample per class). The multi-modal gesture dataset [18, 13] contains about thousand gestures with training samples per class, but it only has classes.
In order to provide to the community with a large dataset for RGB-D gesture recognition, here we take benefit of the previous CGD dataset  by integrating all gesture categories and samples to design two new large RGB-D datasets for gesture spotting and classification. In Table I, the new IsoGD and ConGD datasets show a significant increase in size in terms of both the number of categories and the number of samples in comparison to state-of-the-art alternatives.
Ii-B Data Collection Procedure
As aforementioned, our datasets were derived from the CGD dataset  which was designed for the ”one-shot learning” task. Therefore, we first introduce the CGD dataset. As shown in the left of Fig. 1
, the CGD dataset contained 540 batches (or subfolders). Each batch was designed for a specific ”one-shot learning” task, where it had a small gesture vocabulary, 47 RGB-D video clips including 100 gestures in total, and only one training sample per each class. It was independent among different batches. All batches of CGD had 289 gestures from 30 lexicons and a large number of gestures in total (54000 gestures in about 23000 RGB-D video clips), which makes it good material to carve out different tasks. This is what we did by creating two large RGB-D gesture datasets: the IsoGD111http://www.cbsr.ia.ac.cn/users/jwan/database/isogd.html and ConGD222http://www.cbsr.ia.ac.cn/users/jwan/database/congd.html datasets. The process of data annotation from the CGD dataset is shown in Fig. 1. It mainly includes two steps: semi-manual temporal segmentation and gesture labelling.
In the stage of temporal segmentation, we used a semi-manual software to annotate the begin and end frames of each gesture in video sequences. Later, we checked all 540 batches with its corresponding lexicons, and merged the similar gesture (see one sample in Fig. 2) in different lexicons manually. For detailed information about gesture annotation, the reader is referred to .
Finally, we obtained unique gesture labels, gestures in RGB-D videos, which are derived from batches of the CGD dataset. Some samples of dynamic gestures from both datasets are shown in Fig. 3.
Ii-C Dataset Statistics
The statistical information of both gesture datasets is shown in Table II. For the ConGD dataset, it includes RGB-D gestures in RGB-D gesture videos. Each RGB-D video can represent one or more gestures, and there are gestures labels performed by different. For the IsoGD dataset, we split all the videos of the CGD dataset into isolated gestures, obtaining gestures. Each RGB-D video represents one gesture instance, having gestures labels performed by different individuals.
Ii-D Evaluation Metrics
For both datasets, we provide training, validation, and test sets. In order to make it more challenging, all three sets include data from different subjects, which means the gestures of one subject in validation and test sets will not appear in the training set. According to , we introduced the recognition rate and mean jaccard index (MJI) as the evaluation criteria for the IsoGD and ConGD datasets, respectively.
The MJI is a comprehensive evaluation on the final performance of continuous recognition, but it does not give a specific assessment on either the classification or the segmentation. This drawback makes it difficult to tell whether a high-performance method is attributed to its classification or segmentation strategy.
Therefore, to evaluate the performance of temporal segmentation, we present the corrected segmentation rate (CSR) , based on intersection-over-union (IoU). is defined as:
where is the target model’s predicted segmentations for each video, which is constituted by the position of the starting and ending frames. is the ground truth, which has the same form as . and are the number of segmentations in the model’s prediction and the ground truth, respectively. is the function to evaluate whether the two sections are matched or not with a predefined threshold :
where is the segmentation result that needs evaluation and is the ground truth. The IoU function is defined below, which is similar to its definition for object detection .
where , represent the starting frame, and the ending frame of the segmentation . and are in a manner analogous to and . If IoU(a, b) is greater than the threshold , we consider they are matched successfully.
 total number of teams;  teams that submitted the predicted results on the valid and test sets.
|rank by test||Team||evaluation|
|isolated gesture recognition||recognition rate|
|continuous gesture recognition||MJI|
Ii-E Challenge Tasks
Both large-scale isolated and continuous gesture challenges belong to the series of ChaLearn LAP events333http://chalearnlap.cvc.uab.es/, which were launched in two rounds in conjunction with the ICPR (Cancun, Mexican, December, 2016) and ICCV (Venice, Italy, October, 2017). This competition consisted of a development phase (June 30, 2016 to August 7, 2016 for the first round, April 20, 2017 to June 22, 2017 for the second round) and a final evaluation phase (August 7, 2016 to August 17, 2016 for the first round, June 23, 2017 to July 2, 2017 for the second round). Table III shows the summary of the participation for both gesture challenges. The total number of registered participants of both challenges is more than 200, and 54 teams have submitted their predicted results.
For each round, training, validation and test data sets were provided. Training data were released with labels, validation data were used to provide feedback to participants in the leaderboard and test data were used to determine the winners. Note that each track had its own evaluation metrics. The four tracks were run in the CodaLab platform444https://competitions.codalab.org/. The top three ranked participants for each track were eligible for prizes. The performances of winners are shown in Table IV.
Participants had access to labeled development (training) and unlabeled validation data. During this phase, participants received immediate feedback on their performance on validation data through the leaderboard in CodaLab.
The unlabeled final (test) data were provided. The winners of the contest were determined by evaluating performances on these two datasets. The participants also had to send code and fact sheets describing their methods to challenge organizers. All the code of participants was verified and replicated prior to announcing the winners. All the test labels of both datasets are released nowadays, which can be found on the website of IsoGD and ConGD.
The challenge tasks proposed were both ”user independent” and consist of:
Isolated gesture recognition for the IsoGD dataset.
Gesture spotting and recognition from continuous videos for the ConGD dataset.
As shown in Table II, the datasets were split into three subsets: training, validation, and test. The training set included all gestures from 17 subjects, the validation set included all gestures from 2 subjects, and the rest gestures from 2 subjects were used in the test set. We guaranteed that the validation and test sets included gesture samples from the labels.
Iii Review of State-of-the-art Methods
In recent years, the commercialization of affordable RGB-D sensors, such as Kinect, made it available depth maps, in addition to classical RGB, which are robust against illumination variations and contain abundant 3D structure information. Based on this technology, we created the IsoGD and ConGD datasets , which has been already used by several researchers to evaluate the performance of gesture recognition models. In this section, we review and compare these methods.
As illustrated in Fig. 4, current methods can fall into two categories according to whether they address isolated or continuous gesture recognition. Since gestures in the isolated gesture dataset are separated in advance, the main concern is how to issue a label to a certain gesture. Due to the promising achievement in the fields of object recognition, deep convolutional networks are also considered as the first choice for gesture recognition. The 2D CNN is very common in practice[33, 56]. In order to make a trade-off between the 2D network and spatiotemporal features, techniques like rank pooling are employed to involve the temporal information in an image like dynamic depth image[67, 66, 72, 62, 63, 58]. Some methods [36, 6, 74, 41, 43, 63, 70, 76, 38, 37] use the 3D CNN model  to learn spatiotemporal features directly and concurrently. Meanwhile, the RNN  or its variation LSTM  are also applied to analyze sequential information from input videos [9, 62, 48, 70, 76]. However, the task for continuous gesture recognition can be more arduous. There may be 4-5 gestures in a video, so it is necessary to recognize which gesture a series of motion belong to. It can be achieved either in a frame-by-frame fashion [6, 5] or by a temporal segmentation beforehand [9, 67, 41, 62].
Although deep learning methods work well on the gesture recognition challenge, the quality of data still plays an important role. Therefore pre-processing is commonly employed by many methods [36, 65, 43, 41, 1]. There are two main categories of pre-processing. The first category of pre-processing is video enhancement. Since the videos are obtained under different conditions, the RGB videos are prone to be influenced by illumination changes. Its counterpart, the depth videos, are insensitive to illumination variations but presented some noise. Miao et al. implement the Retinex theory  to normalize the illumination of RGB videos, and use the median filter to denoise depth maps. Asadi et al. utilize a hybrid median filter and inpainting technique to enhance depth videos. The second category of pre-processing is based on frame unification and video calibration. The reason for frame unification is to fix the dimension of all inputs for the fully-connected layers in CNNs. After a statistical analysis of frame number distribution of training data on the IsoGD dataset, Li et al. fix the frame number of each clip as 32 to minimize the loss of motion path in the temporal dimension. The same criterion has been used by most subsequent methods [41, 43, 76, 38, 37, 70]. Meanwhile, although the RGB and depth videos are captured concurrently by a Kinect sensor, ensuring temporal synchronization, they are not often registered spatially. Such a spatial misalignment may affect the multi-modality fusion. Therefore, Wang et al. propose a self-calibration strategy based on a pinhole model to register those data. A similar way is also conducted by Asadi et al., which exploits the intrinsic and extrinsic parameters of cameras to warp the RGB image to fit the depth one.
Temporal segmentation can also be deemed as a kind of pre-processing method, which is only applied for continuous gesture recognition. Since there is more than one gesture in the video for continuous tasks, researchers should pay more attention to separating the gestures from each other. One possibility is dividing the videos into several clips containing only one gesture each, which can then analyzed as the isolated gesture recognition task. Chai et al. first take such a segmentation strategy for continuous gesture recognition. It assumes all gestures begin and end with performers’ hands down. Then the video can be characterized as successive gesture parts and transition parts. A similar idea is used in Wang et al., Liu et al. and Wang et al.. Camgoz et al. conduct such a temporal segmentation in a different way. They treat the segmentation process as a feature to learn, and use the likelihood to split the videos into multiple isolated segments, which is done by localizing the silence regions, where there is no motion.
Iii-B Network models
Network models are the key part of gesture recognition. Common models include 2D CNNs, 3D CNNs, RNN/LSTM, and detection models such as Faster R-CNN.
2D CNN models like AlexNet , VGG  and ResNet  have shown great performance dealing with recognition tasks. There are several methods [44, 66, 65] that first implement the 2D CNN to extract gesture features. However, in its standard way, 2D CNN convolution and pooling only manipulate the spatial dimension, not considering temporal data dynamics. In order to extend 2D CNN to consider temporal information, Wang et al. use rank pooling  to generate dynamic depth images (DDIs), and compute dynamic depth normal images (DDNIs) and Dynamic Depth Motion Normal Images (DDMNIs) to wrap both the motion information and the static posture in an image. A similar idea is utilized by Wang et al.. The counterpart, visual dynamic images (VDIs) for RGB videos, is generated in . Moreover, Wang et al. extend the DDIs for both body level and hand level representation, which are called body level Dynamic Depth Images (BDDIs) and Hand Level Dynamic Depth Images (HDDIs), respectively. Zhang et al. use an enhanced Depth Motion Map (eDMM) to describe depth videos and a Static Pose Map (SPM) for postures. Then two CNNs are used to extract features from these representations. Wang et al.
use the scene flow vector, which is obtained by registered RGB-D data, as a descriptor to generate an action map, which is subsequently fed into AlexNet for classification.
3D CNNs like C3D  were proposed to extend 2D CNNs to compute spatiotemporal features. Li et al.[36, 38, 37] utilize 3D CNN to extract features from RGB-D, saliency, and optical flow videos. Zhu et al.[74, 76] propose a pyramid 3D CNN model, in which the videos are divided into three 16 frame clips, performing prediction in each of them. Final recognition is obtained by means of score fusion. Such a pyramid 3D CNN model is also employed by [62, 70, 63]. Liu et al. and Wang et al. use a 3D CNN model in a similar way for continuous gesture recognition. Camgoz et al.
also use 3D CNNs for feature extraction. However, aiming at continuous gesture recognition, they issue the label of the clip centered at frameto exact that frame. Based on this they find all gestures in a video. In 
, the 3D CNN model is still used in a frame-wise fashion, with the final classification given by posterior estimation after several iterations.
The recurrent neural network (RNN), or its variation, long short-term memory (LSTM),  is a kind of network where connections between units form a directed cycle. The special structure of RNN-like models allows for sequential input analysis. Chai et al. use two streams of RNN to represent features of RGB-D videos, and use LSTM to model the context. Pigou et al. first use a ResNet to extract features of gray-scale video, and then use a bidirectional LSTM  to process both temporal directions. Zhu et al. use convLSTM with 3D CNN input to model the sequential relations between small video clips. The 3D CNN + LSTM scheme is also employed in [70, 62, 39].
The faster R-CNN  was initially proposed for detection tasks. Since the gestures are often started and ended with hands down, the detected location of hands is used to indicate the interval of gestures in a continuous video. Chai et al. use faster R-CNN to help to segment gestures. This strategy is widely applied for continuous gesture recognition [41, 62]. In addition, some methods complemented with hand region detection to further boost recognition performance[63, 58].
Attention-aware methods [45, 71] have also been applied for gesture recognition. Pradyumna et al.  proposed a focus of attention network (FOANet) which introduced a separate channel for every focus region (global, right/left hand) and modality (RGB, depth and flow). Zhang et al..  proposed an attention mechanism embedding into the convolutional LSTM (ConvLSTM) network, including attention analysis in ConvLSTM gates for spatial global average pooling and fully-connected operations.
|Method||pre-processing||model||fusion strategy||modality of data||evaluation|
|Isolated gesture recognition evaluated on the IsoGD dataset||Recognition rate|
|Wan et al.[59, 61]’16||/||MFSK+BoVW||SVM||RGB-D||18.65%||24.19%|
|Li et al. ’16||
|Wang et al. ’16||
|Zhu et al. ’16||
|pyramidal C3D||score fusion||RGB-D||45.02%||50.93%|
|Zhu et al. ’17||
|Wang et al. ’17||calibration||AlexNet||score fusion||RGB-D(SFAM)||36.27%||/|
|Li et al. ’17||
|Li et al. ’17||
|Miao et al. ’17||
|Wang et al. ’17||
|Zhang et al. ’17||
|Zhang et al. ’17||/||AlexNet||score fusion||
|Duan et al. ’17||
|Hu et al. ’18||
|Wang et al. ’18||
|Lin et al. ’18||
|Wang et al. ’18||
|Narayana et al.  ’18||
|Zhu et al.  ’19||hand segmentation||shape representaiton||DTW||RGB-D||-||60.12%|
|Continuous gesture recognition evaluated on the ConGD dataset||MJI||CSR(IoU=0.7)||MJI||CSR(IoU=0.7)|
|Wan et al.[59, 61]’16||
|Wang et al. ’16||
|Chai et al. ’16||
|Camgoz et al. ’16||/||
|Pigou et al. ’17||/||
|Camgoz et al. ’17||
|Wang et al. ’17||
|Liu et al. ’17||
|Zhu et al. ’18||
Iii-C Multi-Modality Fusion Scheme
Since IsoGD and ConGD datasets include two visual modalities, fusion of RGB and Depth data uses to be considered for an enhanced recognition performance. Score fusion is a popular strategy [66, 74, 63, 62, 76, 68]. This kind of scheme consolidates the scores generated by networks which are fed with different modalities. Among these methods, the averaging [74, 63, 62, 76] and multiply [66, 68] score fusions are two of the most frequently applied. Li et al. [36, 38, 37], Zhu et al. , and Miao et al.  adopted feature level fusion. The former methods [36, 74] directly blend the features of RGB and depth modalities in a parallel or serial way, which simply average or concatenate. Considering the relationship between features from different modalities that share the same label, Miao et al.  adopt a statistical analysis based fusion method - canonical correlation analysis, and Li et al. [37, 38] adopt an extension version of discriminative correlation analysis, which tries to maximize the inner-class pair-wise correlations across modalities and intra-class differences within one feature set. Hu et al. pay more attention to the fusion scheme and design a new layer comprised of a group of networks called adaptive hidden layer, which serves as a selector to weight features from different modalities of data. Lin et al.
developed an adaptive scheme for setting weights of each voting sub-classifier via a fusion layer, which can be learned directly by the CNNs.
Iii-D Other Techniques to Boost Performance
Based on the available RGB and depth data modalities in the proposed datasets, additional data modalities have been considered by researchers. Li et al. generate saliency maps to focus on image parts relevant to gesture recognition. Then also use optical flow data  to learn features from image motion vectors. Wang et al. and Asadi et al. notice the drawbacks of optical flow, which can only be used for constructing 2D motion information, and use scene flow, which considers the motion of 3D objects by the combination of RGB and depth data. There are also some methods that employ skeleton data using Regional Multi-person Pose Estimation (RMPE)[21, 58].
Data augmentation is another common way to boost performance. Miao et al. focuses on data augmentation to increase overall dataset size while Zhang et al. mainly augment data to balance the number of samples among different categories. Their augmentation tricks include translation, rotation, Gaussian smoothing and contrast adjustment.
Finetuning with models pre-trained on large datasets uses to be considered to reduce overfitting effect. Most C3D-implemented methods are pre-trained on the sports-1M dataset . In terms of cross-modality finetuning, Zhu et al. first train the networks with RGB and depth data from scratch, and then finetune the depth ones with the models of the RGB counterpart. The same process is done on the RGB network. The result of cross-modality finetuning showed an improvement of 6% and 8% for RGB and depth inputs, respectively.
Iii-E State-of-the-art comparison
According to the previous analysis, here we provide with a state-of-the-art comparative on both IsoGD and ConGD datasets. Considered methods were published in the last three years and results are shown in Table V. On the IsoGD dataset, 2D/3D CNN have been widely used, and the recognition rate has been improved by 58% (from 24.19%  to 82.17% ). Owing to the difficult task for continuous gesture recognition, only a few papers have considered the ConGD dataset. However, the performance has been also improved greatly on the metric of both MJI and CSR since 2017. In Table V, the performance of the proposed Bi-LSTM method is shown, which is further discussed in Section V.
In this section, we review the techniques on both isolated and continuous gesture recognition based on RGB-D data. After the release of the large-scale IsoGD and ConGD datasets, new methods have pushed the development of gesture recognition algorithms. However, there are challenges faced by available methods which allow us to outline several future research directions for the development of deep learning-based methods for gesture recognition.
Fusion of RGB-D modalities. Most methods [66, 74, 63, 62, 76] considered RGB and depth modality as a separate channel and fused them at a later stage by concatenation or score voting, without fully exploiting the complementary properties of both visual modalities. Therefore, cooperative training using RGB-D data would be a promising and interesting research direction.
Attention-based mechanism. Some methods [45, 41, 9] used hand detectors to first detect hand regions and then designed different strategy to extract local and global features for gesture recognition. However, these attention-based methods need hard to train specialized detectors to find hand regions properly. It would be more reasonable to consider sequence modeling self-attention [53, 54] and exploit it for dynamic gesture recognition.
Simultaneous gesture segmentation and recognition. Existing continuous gesture recognition works [41, 63, 67, 59] first detect the first and end point of each isolated gesture, and then train/test each segment separately. This procedure is not suitable for many real applications. Therefore, simultaneous gesture segmentation and recognition would be an interesting line to be explored.
Iv Temporal Segmentation Benchmark
In this section, we propose a benchmark method, namely Bi-LSTM network, for temporal segmentation. Before introducing the proposed method, we first illustrate drawbacks of the current temporal segmentation methods.
Iv-a Drawbacks of temporal segmentation methods
1) Hand-crafted Hand Motion Extraction:
Some methods [59, 61, 67, 63] first measure the quantity of movement (QoM) for each frame in a multi-gesture sequence and then threshold the quantity of movement to obtain candidate boundaries. Then, a sliding window is commonly adopted to refine the candidate boundaries to produce the final boundaries of the segmented gesture sequences in a multi-gesture sequence. However, it captures not only the hand motion but also background movements which may be harmful to temporal segmentation.
2) Unstable Hand Detector: Some methods [9, 41] used the Faster R-CNN 
to build the hand detector. Owing to the high degree of freedom of human hands, it is very hard to tackle some intractable environments, such as hand-self occlusion and drastically hand shape changing. The errors of hand detection would considerably reduce the performance of temporal segmentation.
3) Strong Prior Knowledge Requirement: Most previous methods (e.g., [59, 61, 9, 41, 67]) use prior knowledge (e.g., a performer always raises hands to start a gesture, and puts hands down after performing a gesture). The strong prior knowledge (i.e., hand must lay down after performing another gesture) is not practical for real applications.
In contrast to previous methods, we did not only use human hands but also the arm/body information [9, 41]. Moreover, we designed a Bi-LSTM segmentation network to determine the start-end frames of each gesture automatically without requiring specific prior knowledge.
Iv-B the Proposed Bi-LSTM Method
We treat the temporal segmentation as a binary classification problem. The flowchart is shown in Fig. 5. We first use the convolutional pose machine (CPM) algorithm555https://github.com/CMU-Perceptual-Computing-Lab/openpose [8, 55, 69] to estimate the human pose, which consists of keypoints ( keypoints for human body, keypoints for left and right hands, respectively). The keypoints are shown in the left part of Fig. 5. Therefore, the human gesture/body from an image is represented by these keypoints. For the frame of a video, the gesture is represented by a -dimension () vector in Eq. 4.
where the coordinate of the keypoint is represented by , the average coordinate of all detected keypoints is denoted by , and , , is the number of detected keypoints of frame .
We use the data to train the Bi-LSTM network , where is the total number of frames in the video, and is the start and end frames indicator of a gesture, i.e., indicates the start and last frames of a gesture, and for other frames. The Bi-LSTM network combines the bidirectional RNNs (BRNNs)  and the long short-term memory (LSTM), which captures long-range information in bi-directions of inputs. The LSTM unit is implemented by the following composite function:
is the activation function,, , , and are the input gate, forget gate, output gate, cell activation vector and the hidden vector at time , respectively. For the Bi-LSTM, the network computes both the forward and backward hidden vectors and at time , and the output sequence as:
|CSR*||Validation Set||Testing Set|
|Wang et al.||0.857||0.7954||0.752||0.6997||0.5908||0.7711||0.6963||0.6636||0.6265||0.5497|
|Chai et al.||-||-||-||-||-||0.709||0.5278||0.3213||0.1458||0.0499|
|Camgoz et al.||-||-||-||-||-||0.7715||0.7008||0.6603||0.6054||0.5216|
|Liu et al.||0.9313||0.9122||0.9034||0.8895||0.8132||0.9237||0.9032||0.8917||0.873||0.7750|
|Wang et al.||0.857||0.7954||0.7520||0.6997||0.598||0.7711||0.6963||0.6636||0.6265||0.5497|
|pigou et al.||0.7247||0.6625||0.6159||0.5634||0.4772||0.7313||0.6642||0.6241||0.5722||0.4951|
|Camgoz et al.||0.8856||0.8421||0.8213||0.8024||0.7375||0.8833||0.8441||0.8254||0.8038||0.7187|
*column header to are IoU thresholds for CSR.
We design four hidden layers in the Bi-LSTM network in Fig. 5. Notably, if a frame is within the segment of a gesture, we annotate it as the positive sample; otherwise, it is treated as negative. To mitigate the class imbalance issue, we assign different weights to the positive and negative samples. The objective function is defined as:
where is the parameter matrix of the softmax function, and is the weight used to mitigate the class imbalance issue. According to our statistics, the ratio of the positive and negative samples is approximately . Thus, we set ( is the weight penalty of positive samples) to balance the loss terms of positive and negative samples in the training phase.
The gradient of the objective function is computed as
where is the gradient with respect to the parameter , and is the indicator function, i.e., if and only if .
In this way, we use the learned model by the Bi-LSTM network to predict the probability of each frame whether it is the start or end frames. If the probability value of a frame is large than 0.5, this frame is treated as start or end frames.
In this section, we evaluate and compare our proposed gesture recognition by segmentation strategy on the ConGD dataset. First, the experimental setup is presented, including the running environments and settings. Then, the performances and comparisons on ConGD datasets are given.
Our experiments are conducted on a NVIDIA Titan Xp GPU with PyTorch. For the Bi-LSTM segmentation network, the input is a 120-dimension vector. We use the Adam algorithm  to optimize the model with the batch size . The learning rate starts from , and the models are trained for up to epochs.
The performance of the proposed Bi-LSTM method for temporal segmentation is shown in Table VI, which achieves 0.9668 and 0.9639 for CSR@IoU=0.7 on both validation and testing sets of ConGD. After temporal segmentation, we use the model of  to perform final gesture recognition. The results are also shown in Table VI, where MJI and on the validation and test sets, respectively. Based on MJI and CSR, our method ahieves the best performance. Although the metric of MJI depends on both temporal segmentation and final classification, the recognition performance of MJI can still benefit from an accurate temporal segmentation, such as the proposed Bi-LSTM method.
We also provide with a MJI comparison for each category on the ConGD dataset in Fig. 8. Here, our method (overall MJI: 0.6830 for validation set, 0.7179 for test set) is compared with two state-of-the-art methods  (overall MJI: 0.5163 for validation set, 0.6103 for test set) and  (overall MJI: 0.5368 for validation set, 0.7163 for test set) for each category. From Fig. 8, compared with , one can see that the methods of Zhu  and ours achieve high performance in most categories. Our method obtain high performance for all categories, while the other two methods fail recognizing some categories, such as gesture class ID 20, 62 for Liu’s method , and gesture class ID 62, 148 for Zhu’s method .
Fig. 6 shows the CSR curve in each epoch under different IoU thresholds from to . One can see that when the IoU threshold is between and , the CSR is very stable after epochs. When IoU is equal to , the training epochs for the CSR increases. This is because the correct condition is more strict ( overlapped region will be treated as the correct one) and it will cost more time to seek the best CSR. Our proposed Bi-LSTM method can get very stable results under different IoU thresholds. For example, even the IoU is equal to , the CSR of our method still is higher than . Alternative temporal segmentation methods [67, 41, 48, 5] are relatively inferior (the best is about in ) on the validation set. Also, our Bi-LSTM can get the best performance on the test set of the ConGD dataset.
Then, we randomly select 1000 video sequences in the ConGD datasets to check for computational requirements. It required about 0.4 second under GPU environment (NVIDIA TITAN X (Pascal)) and 6 seconds on CPU environment (Intel(R) Core(TM) i7-5820K@3.30GHz) for the proposed Bi-LSTM method (excluding CPM algorithm). It demonstrates the proposed Bi-LSTM method is ultra high-speed processing (0.4ms/video-GPU, 6 ms/video-CPU).
Finally, we selected 3 longest video sequences of the ConGD dataset, and the segmentation results of the proposed Bi-LSTM method are shown in Fig. 7. The green points are the ground truth of the segmentation point, while the blue line is the confidence of positive responses. These three videos have more than 100 frames and contain at least 5 gestures. Compared with the videos with fewer number of gestures, the dense gestures make it hard to find the segment points accurately. However, our Bi-LSTM method can mark the start and end points of each gesture, and the segmentation for all the gestures are with confidence over 0.8.
In this paper, we proposed the IsoGD and ConGD datasets for the task of isolated and continuous gesture recognition, respectively. Both datasets are the current largest datasets for dynamic gesture recognition. Based on both datasets, we have run challenges in ICPR 2016 and ICCV 2017 workshops, which attracted more than 200 teams around the world and pushed the state-of-the-art for large-scale gesture recognition. Then, we reviewed last 3-years state-of-the-art methods for gesture recognition based on the provided datasets. In addition, we proposed the Bi-LSTM method for temporal segmentation. We expect the proposed datasets to push the research in gesture recognition.
Although some published papers have achieved promising results on the proposed datasets, there are several venues which can be explored to further improve gesture recognition. First, a structure attention mechanism can be further explored. In our method, each attention part (i.e., arm, gesture) is trained separately. We believe gesture recognition will benefit from joint structure learning (i.e.body, hand, arm and face). Second, new end-to-end methods are expected to be designed. We discussed different works that benefited from the fusion and combination of different trained models and modalities. By considering end-to-end learning it is expected to further boost performance. This will also require detailed analyses on computation/accuracy trade-off. Moreover, online gesture recognition is another challenging problem. The research for an efficient gesture detection/spotting/segmentation strategy is an open issue. We expect ConGD dataset to support the evaluation of models in this direction.
-  (2017) Action recognition from rgb-d data: comparison and fusion of spatio-temporal handcrafted features and deep strategies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3179–3188. Cited by: §III-A, §III-D.
-  (2015) ChaLearn looking at people 2015 challenges: action spotting and cultural event recognition. In Workshops in Conjunction with IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. Cited by: §I.
-  (1994) Learning long-term dependencies with gradient descent is difficult. TNNLS 5 (2), pp. 157–166. Cited by: §III-B, §III.
-  (2014) ImageNet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §II-A.
-  (2017) Particle filter based probabilistic forced alignment for continuous gesture recognition. In Workshops in Conjunction with IEEE International Conference on Computer Vision, Cited by: TABLE IV, §III-A, §III-B, TABLE V, §III, TABLE VI, §V.
-  (2016) Using convolutional 3d neural networks for user-independent continuous gesture recognition. In Proceedings of International Conference on PR, pp. 49–54. Cited by: §I, TABLE IV, §III-B, TABLE V, §III, TABLE VI.
Egocentric gesture recognition using recurrent 3d convolutional neural networks with spatiotemporal transformer modules. In The IEEE International Conference on Computer Vision (ICCV), Cited by: TABLE I.
-  (2017) Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, Cited by: §I, §IV-B.
-  (2016) Two streams recurrent neural networks for large-scale continuous gesture recognition. In Proceedings of International Conference on PR, pp. 31–36. Cited by: §I, TABLE IV, §III-A, §III-B, §III-B, §III-F, TABLE V, §III, §IV-A, TABLE VI.
-  (2017) A unified framework for multi-modal isolated gesture recognition. In tomm, Cited by: TABLE IV, TABLE V.
-  (2016) Chalearn joint contest on multimedia challenges beyond visual analysis: an overview. In Proceedings of International Conference on PR, pp. 67–73. Cited by: §I.
Challenges in multimodal gesture recognition.
Journal of Machine Learning Research17, pp. 72:1–72:54. Cited by: §I.
-  (2014) ChaLearn looking at people challenge 2014: dataset and results. ChaLearn LAP Workshop, ECCV. Cited by: §II-A, §II-A.
-  (2015) ChaLearn looking at people 2015: apparent age and cultural event recognition datasets and results. In International Conference in Computer Vision, Looking at People, ICCVW, Cited by: §II-A.
-  (2015) ChaLearn looking at people 2015 new competitions: age estimation and cultural event recognition. In International Joint Conference on Neural Networks, pp. 1–8. Cited by: §I.
-  (2013) ChaLearn multi-modal gesture recognition 2013: grand challenge and workshop summary. In International Conference on Multimodal Interaction, pp. 365–368. Cited by: §I.
-  (2013) Multi-modal gesture recognition challenge 2013: dataset and results. In International Conference on Multimodal Interaction, pp. 445–452. Cited by: §I.
-  (2013) Multi-modal gesture recognition challenge 2013: dataset and results. In Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 445–452. Cited by: TABLE I, §II-A, §II-A.
-  (2016) Guest editors’ introduction to the special issue on multimodal human pose recovery and behavior analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (8), pp. 1489–1491. Cited by: §I.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §II-D.
-  (2017) RMPE: regional multi-person pose estimation. Proceedings of the IEEE International Conference on Computer Vision. Cited by: §III-D.
-  (2017) Rank pooling for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (4), pp. 773–787. Cited by: §III-B.
-  (2015) Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448. Cited by: §III-B.
-  (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18 (5-6), pp. 602–610. Cited by: §III-B, §IV-B.
-  (2018) Dominant and complementary emotion recognition from still images of faces. IEEE Access 6, pp. 26391–26403. Cited by: §I.
-  (2013) Results and analysis of the chalearn gesture challenge 2012. In Advances in Depth Image Analysis and Applications, pp. 186–204. Cited by: TABLE I, §II-A, §II-A, §II-B.
-  (2018) Changes in facial expression as biometric: a database and benchmarks of identification. In Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on, pp. 621–628. Cited by: §I.
-  (2016) Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §III-B.
-  (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §III-B, §III.
Learning adaptive hidden layers for mobile gesture recognition.
AAAI Conference on Artificial Intelligence, Cited by: §III-C, TABLE V.
-  (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1725–1732. Cited by: §III-D.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §III-B, §III.
-  (2017) Automatic recognition of facial displays of unfelt emotions. IEEE Transactions on Affective Computing. Cited by: §I.
-  (1971) Lightness and retinex theory. Josa 61 (1), pp. 1–11. Cited by: §III-A.
-  (2016) Large-scale gesture recognition with a fusion of rgb-d data based on the c3d model. In Proceedings of International Conference on PR, pp. 25–30. Cited by: TABLE IV, §III-A, §III-B, §III-C, TABLE V, §III.
-  (2017) Large-scale gesture recognition with a fusion of rgb-d data based on saliency theory and c3d model. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §III-A, §III-B, §III-C, §III-D, TABLE V, §III.
-  (2017) Large-scale gesture recognition with a fusion of rgb-d data based on optical flow and the c3d model. Pattern Recognition Letters. Cited by: §III-A, §III-B, §III-C, §III-D, TABLE V, §III.
-  (2018) Large-scale isolated gesture recognition using a refined fused model based on masked res-c3d network and skeleton lstm. In Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, Cited by: §III-B, §III-C, TABLE V, §V.
-  (2013) Learning discriminative representations from rgb-d video data.. In IJCAI, Cited by: TABLE I, §II-A.
-  (2017) Continuous gesture recognition with hand-oriented spatiotemporal feature. In Workshops in Conjunction with IEEE International Conference on Computer Vision, pp. 3056–3064. Cited by: §I, TABLE IV, §III-A, §III-A, §III-B, §III-B, §III-F, TABLE V, §III, §IV-A, TABLE VI, Fig. 8, §V, §V.
-  (2016) Sase: rgb-depth database for human head pose estimation. In European Conference on Computer Vision, pp. 325–336. Cited by: §I.
-  (2017) Multimodal gesture recognition based on the resc3d network. In Workshops in Conjunction with IEEE International Conference on Computer Vision, pp. 3047–3055. Cited by: TABLE IV, §III-A, §III-C, §III-D, TABLE V, §III.
-  (2011) Max-pooling convolutional neural networks for vision-based hand gesture recognition. In IEEE International Conference on Signal and Image Processing Applications, pp. 342–347. Cited by: §III-B.
-  (2018) Gesture recognition: focus on the hands. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5235–5244. Cited by: §III-B, §III-E, §III-F, TABLE V.
-  (2017) Audio-visual emotion recognition in video clips. IEEE Transactions on Affective Computing. Cited by: §I.
-  (2017) Automatic differentiation in pytorch. Cited by: §V.
-  (2017) Gesture and sign language recognition with temporal residual networks. In Workshops in Conjunction with IEEE International Conference on Computer Vision, pp. 3086–3093. Cited by: TABLE IV, §III-B, TABLE V, §III, TABLE VI, §V.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §IV-A.
-  (2013) ChAirGest: a challenge for multimodal mid-air gesture recognition for close hci. In ACM on International conference on multimodal interaction, pp. 483–488. Cited by: TABLE I, §II-A.
-  (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §IV-B.
-  (2016) Challenges in multimodal gesture recognition. Journal on Machine Learning Research. Cited by: §II-A.
-  (2018) Reinforced self-attention network: a hybrid of hard and soft attention for sequence modeling. arXiv preprint arXiv:1801.10296. Cited by: §III-F.
-  (2018) Bi-directional block self-attention for fast and memory-efficient sequence modeling. arXiv preprint arXiv:1804.00857. Cited by: §III-F.
-  (2017) Hand keypoint detection in single images using multiview bootstrapping. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §IV-B.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §III-B, §III.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497. Cited by: §III-B, §III.
-  (2017) Results and analysis of chalearn lap multi-modal isolated and continuous gesture recognition, and real versus fake expressed emotions challenges. In ChaLearn LaP, Action, Gesture, and Emotion Recognition Workshop and Competitions: Large Scale Multimodal Gesture Recognition and Real versus Fake expressed emotions, ICCV, Vol. 4. Cited by: §I, §III-B, §III-D, §III.
-  (2015) Explore efficient local features from rgb-d data for one-shot learning gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §III-F, TABLE V, §IV-A.
-  (2013) One-shot learning gesture recognition from RGB-D data using bag of features. Journal of Machine Learning Research 14 (1), pp. 2549–2582. Cited by: §I.
-  (2016) Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In Workshops in Conjunction with IEEE Conference on Computer Vision and Pattern Recognition, pp. 56–64. Cited by: §II-B, §II-D, TABLE IV, §III-E, TABLE V, §III, §IV-A.
-  (2017) Large-scale multimodal gesture recognition using heterogeneous networks. In Workshops in Conjunction with IEEE International Conference on Computer Vision, pp. 3129–3137. Cited by: TABLE IV, §III-A, §III-B, §III-B, §III-B, §III-B, §III-C, §III-F, TABLE V, §III, TABLE VI.
-  (2017) Large-scale multimodal gesture segmentation and recognition based on convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3138–3146. Cited by: TABLE IV, §III-B, §III-B, §III-B, §III-C, §III-F, TABLE V, §III, §IV-A.
-  (2018) Depth pooling based large-scale 3-d action recognition with convolutional neural networks. IEEE Transactions on Multimedia 20 (5), pp. 1051–1061. Cited by: TABLE V.
-  (2017) Scene flow to action map: a new representation for rgb-d based action recognition with convolutional neural networks. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Cited by: §III-A, §III-B, §III-D, TABLE V.
-  (2016) Large-scale isolated gesture recognition using convolutional neural networks. In Proceedings of International Conference on PR, pp. 7–12. Cited by: TABLE IV, §III-B, §III-C, §III-F, TABLE V, §III.
-  (2016) Large-scale continuous gesture recognition using convolutional neural networks. In Proceedings of International Conference on PR, pp. 13–18. Cited by: §I, TABLE IV, §III-A, §III-F, TABLE V, §III, §IV-A, TABLE VI, §V.
-  (2018) Cooperative training of deep aggregation networks for rgb-d action recognition. In AAAI Conference on Artificial Intelligence, Cited by: §III-B, §III-C, TABLE V.
-  (2016) Convolutional pose machines. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §IV-B.
-  (2017) Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3120–3128. Cited by: TABLE IV, §III-A, §III-B, §III-B, TABLE V, §III.
-  (2018) EgoGesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Transactions on Multimedia 20 (5), pp. 1038–1050. Cited by: §III-B.
-  (2017) Gesture recognition using enhanced depth motion map and static pose map. In Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition, pp. 238–244. Cited by: §III-B, §III-D, TABLE V, §III.
-  (2019) Vision based hand gesture recognition using 3d shape context. IEEE/CAA Journal of Automatica Sinica (), pp. 1–14. External Links: Cited by: TABLE V.
-  (2016) Large-scale isolated gesture recognition using pyramidal 3d convolutional networks. In Proceedings of International Conference on PR, pp. 19–24. Cited by: TABLE IV, §III-B, §III-C, §III-F, TABLE V, §III.
-  (2018) Continuous gesture segmentation and recognition using 3dcnn and convolutional lstm. IEEE Transactions on Multimedia. Cited by: TABLE V, Fig. 8, §V.
-  (2017) Multimodal gesture recognition using 3d convolution and convolutional lstm. IEEE Access. Cited by: §III-A, §III-B, §III-B, §III-C, §III-D, §III-F, TABLE V, §III.