A Deep Learning Framework for Recognizing both Static and Dynamic Gestures

06/11/2020 ∙ by Osama Mazhar, et al. ∙ 0

Intuitive user interfaces are indispensable to interact with human centric smart environments. In this paper, we propose a unified framework that recognizes both static and dynamic gestures, using simple RGB vision (without depth sensing). This feature makes it suitable for inexpensive human-machine interaction (HMI). We rely on a spatial attention-based strategy, which employs SaDNet, our proposed Static and Dynamic gestures Network. From the image of the human upper body, we estimate his/her depth, along with the region-of-interest around his/her hands. The Convolutional Neural Networks in SaDNet are fine-tuned on a background-substituted hand gestures dataset. They are utilized to detect 10 static gestures for each hand and to obtain hand image-embeddings from the last Fully Connected layer, which are subsequently fused with the augmented pose vector and then passed to stacked Long Short-Term Memory blocks. Thus, human-centered frame-wise information from the augmented pose vector and left/right hands image-embeddings are aggregated in time to predict the dynamic gestures of the performing person. In a number of experiments we show that the proposed approach surpasses the state-of-the-art results on large-scale Chalearn 2016 dataset. Moreover, we also transfer the knowledge learned through the proposed methodology to the Praxis gestures dataset, and the obtained results also outscore the state-of-the-art on this dataset.



There are no comments yet.


page 2

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern manufacturing industry requires human-centered smart frameworks, which should aim to focus on human abilities and not conversely demand humans to adjust to whatever technology. In this context, gesture-driven user-interfaces tend to exploit human’s prior knowledge and are vital for intuitive interaction of humans with smart devices [20]. Gesture recognition is a problem that has been widely studied for developing human-computer/machine interfaces with an input device alternative to the traditional ones (e.g., mouse, keyboard, teach pendants and touch interfaces). Its applications include robot control [17], health monitoring systems [14], interactive games [31] and sign language recognition [29].

The aim of our work is to develop a robust, vision-based gestures recognition strategy suitable for human-machine/computer interaction tasks. We intend to realize a unified framework to recognize static gestures from a single image, as well as dynamic gestures from video sequences. Two datasets are exploited in our work i.e., OpenSign [22] and Chalearn 2016 isolated gestures dataset [38], referred to simply as Chalearn 2016 in the rest of the paper. OpenSign contains color and registered raw depth images from Kinect V2, recorded for American Sign Language (ASL) static gestures performed by volunteers. Chalearn 2016 is a large-scale dataset which contains Kinect V1 color and depth recordings in resolution of dynamic gestures recorded with the help of volunteers. The gesture vocabulary in Chalearn 2016 is mainly from nine groups corresponding to different application domains: body language gestures, gesticulations, illustrators, emblems, sign language, semaphores, pantomimes, activities and dance postures. The dataset has videos with each video (color depth) representing one gesture.

Initially, we design a robust static hand gestures detector, which exploits OpenSign to fine-tune Inception V3, on static hand gestures as presented in our previous work [21]. This enables the fine-tuned Inception V3 to extract hand specific features frame-by-frame, invariant to background and illumination changes. Subsequently, we integrate our fine-tuned Inception V3 with Long Short-Term Memory (LSTM) blocks for dynamic gestures recognition. This is inspired by the long-term recurrent convolutional networks (LRCN) proposed by Donahue et al. in [7]. We name our unified network SaDNet - Static and Dynamic gestures Network. The idea of visual attention presented in [32] is also integrated in SaDNet, which eventually is based on human selective focus and perception. Thus, we develop a spatial-attention mechanism, which focuses on the human upper body and on his/her hands (see Fig. 1). It is also noteworthy that in RGB images, scale information about the subjects (e.g., size of his/her body parts) is lost. Thus, exploiting as sole input the image of the person’s upper body, we devise learning-based depth estimators to regress the distance of the hands from the sensor. Then, we can scale the 2D human skeleton and determine the region-of-interest/bounding boxes around the his/her hands.

In this paper, we thoroughly present our network SaDNet for dynamic gestures recognition. Our spatial-attention module allows SaDNet to learn large-scale upper-body motions plus subtle hand movements, and therefore distinguish several inter-class ambiguities. The proposed network is trained exclusively on RGB videos of Chalearn 2016 to detect 249 dynamic gestures, while outperforming the state-of-the-art recognition scores on that same dataset.

Fig. 1: Illustration of our framework. We employ openpose

to extract 2D upper-body pose plus hands key-points of the person from monocular

RGB image frames. Then, hand key-points, original RGB input images and the outputs of our hands depth estimators, and , are passed to the Focus on Hands Module, which crops images of the persons’ hands. Meanwhile, the Pose Pre-Processing Module

filters the skeleton, by interpolating missing joints coordinates, and smoothing the pose. It also perfoms

scale normalization, by exploiting the skeleton depth estimator , and position normalization by subtracting root (neck) coordinates from all others. The normalized skeleton appears approximately in the center of the output image, done only for visualization purpose. This normalized skeleton is then passed to a

Pose Augmentation and Dynamic Features Extraction

block. Cropped left/right hands images and augmented pose vector are finally fed to SaDNet, which outputs static hand gestures labels for each input image, and dynamic gesture labels for a sequence of images/a video.

2 Related Work

Traditional activity recognition approaches aggregate local spatio-temporal information via hand-crafted features. These visual representations include the Harris3D detector [19], the Cuboid detector [6], dense sampling of video blocks [42], dense trajectories [39] and improved trajectories [40]. Visual representations obtained through optical flow, e.g., Histograms of Oriented Gradients (HOG), Histograms of Optical Flow (HOF) and Motion Boundary Histograms (MBH) have also given excellent results for video classification on a variety of datasets [42, 41]. In these approaches, global descriptors of the videos are obtained by encoding the hand-crafted features using Bag of Words (BoW) and Fischer vector encodings [33], which assign descriptors to one or several nearest elements in a vocabulary [15]

, while classification is typically performed through Support Vector Machines (SVMs).

Lately, the tremendous success of deep neural networks on image classification tasks [9, 35] instigated exploitation of the same in the domain of activity recognition. The literature on gestures/activity recognition exploiting deep neural networks is already enormous. Here, we focus on related notables which have inspired our proposed strategy.

2.1 3D Convolutional Neural Networks

Among the pioneer works in this category, [13] adapts Convolutional Neural Networks (CNNs) to 3D volumes (3D-CNNs) obtained by stacking video frames, to learn spatio-temporal features for action recognition. In [1]

, Baccouche et al. proposed an approach for learning the evolution of temporal information through LSTM recurrent neural networks 

[10] from features extracted through 3D-CNNs applied to short video clips of approximately 9 successive frames. However, Karpathy et al. in [16] found that the stacked-frames architecture performed similar to the single-image one.

To handle resulting high-dimensional video representations, the authors of [48]

proposed the use of random projection-based ensemble learning in deep networks for video classification. They also proposed rectified linear encoding (RLE) method to deal with redundancy in the initial results of classifiers. The output from RLE is then fused by a fully-connected layer that produces the final classification results.

2.2 Multi-modal Multi-scale strategies

The authors of [29] presented a multi-modal multi-scale detection strategy for dynamic poses of varying temporal scales as an extension to their previous work [28]. The employed modalities include color and depth videos, plus articulated pose information obtained through depth map. The authors proposed a complex learning method which includes pre-training of individual classifiers on separate channels and iterative fusion of all modalities on shared hidden and output layers. This approach involved recognizing 20 categories from Italian conversational gestures, performed by different people and recorded with an RGB-D sensor. The proposed strategy is similar in function to [16] except that it included depth images and pose as additional modalities. However, it lacked a dedicated equipment to learn evolution of temporal information and may fail when understanding long-term dependencies of the gestures is required.

In [23], authors proposed a multi-modal large-scale gesture recognition scheme on Chalearn 2016 Looking at People isolated gestures recognition dataset [38]. In [36], ResC3D network is exploited for feature extraction, and late fusion combines features from multi-modal inputs in terms of canonical correlation analysis. The authors used linear SVM to classify final gestures. They propose a key frame attention mechanism, which relies on movement intensity in the form of optical flow, as an indicator for frame selection.

2.3 Multi-stream Optical Flow-based Methods

The authors of [34] proposed an optical flow-based method exploiting convolutional network networks for activity recognition along the same lines of [16]. They presented the idea of decoupling spatial and temporal networks. The proposed architecture in [34] is related to two-stream hypothesis of the human visual cortex [8]. The spatial stream in this work operates on individual video frames, while the input to the temporal stream is formed by stacking optical flow displacement fields between multiple consecutive frames.

The authors of [44] presented improved results in action recognition, by employing a trajectory-pooled two-stream CNN inspired by [34]. They exploited the concept of improved trajectories as low level trajectory extractor. This allows characterization of the background motion in two consecutive frames through the estimation of the homography matrix taking into account camera motion. Optical flow-based methods (e.g., the key frame attention mechanism proposed in [23]) may help emphasizing frames with motion, but are unable to differentiate motion caused by irrelevant objects in the background.

2.4 CNN-LSTM and Convolutional-LSTM Networks

The work in [47] proposed aggregation of frame-level CNN activations through 1) Feature-pooling method and 2) LSTM network for longer sequences. The authors argued that predictions on individual frames of video sequences or on shorter clips as in [16] may only contain local information of the video description and may confuse classes if there are fine-grained distinctions.

The authors in [7] proposed a Long-term Recurrent Convolutional Network (LRCN) for multiple situations including sequential input and static output for cases like activity recognition. The visual features from RGB images are extracted through a deep CNN, which are then fed into stacked LSTM in distinctive configurations corresponding to the task at hand. The parameters are learned in an “end-to-end” fashion, such that the visual features which are relevant to the sequential classification problem are extracted.

The authors in [45] proposed a method to process sequential images through Convolutional-LSTM (ConvLSTM), which is a variant of LSTM containing a convolution operation inside the LSTM cell. In [50], the authors studied redundancy and attention in ConvLSTM by deriving its several variants for gesture recognition. They proposed Gated-ConvLSTM by removing spatial convolutional structures in the gates as they scarcely contributed to the spatio-temporal feature fusion in their study. The authors evaluated results on Chalearn 2016 dataset and found that the Gated-ConvLSTM achieved reduction in parameters size and in computational cost. However, it did not improve detection accuracy to a considerable amount.

2.5 Multi-Label Video Classification

The authors of [46]

presented a multi-label action recognition scheme. It is based on Multi-LSTM network which tackles with multiple inputs and outputs. The authors fine-tuned VGG-16 CNN which is already trained on ImageNet

[18], on Multi-THUMOS dataset, which is an extension of THUMOS dataset [12], on an individual frame level. A fixed length window of 4096-dimensional “fc7” features of the fine-tuned VGG-16, is passed as input to the LSTM through an attention mechanism that weights the contribution of individual frames in the window.

2.6 Attention-based Strategies

The application of convolutional operations on entire input images tends to be computationally complex and expensive. In [32], Rensink discussed the idea of visual representation, which implies that humans do not form detailed depiction of all objects in a scene. Instead, their perception focuses selectively on the objects needed immediately. This is supported by the concept of visual attention applied for deep learning methods as in [24].

Baradel et al. [3] proposed a spatio-temporal attention mechanism conditioned on human pose. The proposed spatial-attention mechanism was inspired by the work of Mnih et al. [24] on glimpse sensors. A spatial attention distribution is learned conjointly through the hidden state of the LSTM network and through the learned pose feature representations. Later, Baradel et al. extend their work in [2] and proposed that the spatial attention distribution can be learned only through an augmented pose vector, which is defined by the concatenation of current pose, velocity and accelerations of each joint over time.

The authors in [49] proposed a three streams attention network for activity detection. These are statistic-based, learning-based and global-pooling attention streams. Shared ResNet is used to extract spatial features from image sequences. They also propose a global attention regularization scheme to enable exploited recurrent networks to learn dynamics based on global information.

Lately, the authors of [25] presented the state-of-the-art results on Chalearn 2016 dataset. They proposed a novel multi-channel architecture, namely FOANet, built upon a spatial focus of attention (FOA) concept. They crop the regions of interest occupied by hands in the RGB and depth images, through the region proposal network and Faster R-CNN method. The architecture comprises 12 channels in total with: 1 global (full-sized image) channel and 2 focused (left and right hand crops) channels for each of the 4 modalities (RGB, depth and optical flow fields extracted from RGB and depth images). The softmax scores of each modality are fused through a sparse fusion network.

Fig. 2: The Skeleton Filter described in Sect. 3.1.1. Images are arranged from left to right in chronological order. The central image shows the skeleton output by the filter. The six other images show the raw skeletons output by openpose. Observe that – thanks to equation (1) – our filter has added the right wrist coordinates (shown only in the central image). These are obtained from the -th frame, while they were missing in all raw skeletons from frame to . We then apply Gaussian smoothing that removes jitter in the skeleton output by openpose.

2.7 Our Strategy

In this work, we develop a novel unified strategy to model human-centered spatio-temporal dependencies for the recognition of static as well as dynamic gestures. We employ CNN to extract spatial features from the input frames and LSTM to learn their evolution over time. Our Spatial Attention Module localizes and crops hand images of the person, which are subsequently passed as inputs to our SaDNet unlike previous methods that take entire images as input e.g., [7, 47]. Contrary to [46], where a pre-trained state-of-the-art network is fine-tuned on entire image frames of gestures datasets, we fine-tune Inception V3 on a background-substituted hand gestures dataset, used as our CNN block. Thus, our CNN has learned to concentrate on image pixels occupied exclusively by hands. This enables it to accurately distinguish subtle hand movements. We have fine-tuned Inception V3 with a softmax layer, to classify 10 ASL static hand gestures while the last fully connected (FC) layer of the network is an image-embedding vector of size elements used as input for the dynamic gestures detector. Contrary to previous strategies for dynamic gestures recognition/video analysis [29, 3, 2] which employed 3D human skeletons to learn large-scale body motion – and corresponding sensor modalities – we only utilize 2D upper-body skeleton as an additional modality to our algorithm. However, scale information about the subjects is lost in monocular images. Thus, we also propose learning-based depth estimators, which determine the approximate depth of the person from the camera and region-of-interest around his/her hands from upper-body 2D skeleton coordinates only. To reiterate, the inputs to our SaDNet are limited only to color hand images and an augmented pose vector obtained from 8 upper-body 2D skeleton coordinates, unlike other existing approaches like [25], which include full-frame images in addition to hand images, depth frames and even optical flow frames altogether. Thus, our proposed strategy is generic and straightforward to implement on mobile systems, commonly present in the IIoT.

3 Spatial Attention Module

Our spatial attention module is divided into two parts: Pose Pre-processing Module and Focus on Hands Module (see Fig. 1). We detail these modules in the following.

3.1 Pose Pre-processing Module

We employ openpose [4] which is an efficient discriminative 2D pose extractor, to extract the human skeleton and human hands’ keypoints in images. Any other skeleton extractor can be employed in place of openpose. We first resize the dataset videos to pixels, where is the value of resized image columns obtained with respect to new row value i.e., 1080, while maintaining the aspect ratio of the original image (1440 in our work). Resizing the images is necessary, since the neural network which performs scale normalization (which we will explain in Sect. 3.1.2), is trained on augmented pose and ground-truth depths obtained from OpenSign Kinect V2 images of size . After having resiezd the videos, we feed them to openpose, one at a time, and the output skeleton joint and hand keypoint coordinates are saved for offline pre-processing.

The pose pre-processing is composed of three parts, detailed hereby: skeleton filter, skeleton position and scale normalization and skeleton depth estimation.

3.1.1 Skeleton Filter

on each image, openpose extracts skeleton joint coordinates ( depends on the selected body model) and does not employ pose tracking between images. The occasional jitter in the skeleton output and absence of joint coordinates within successive frames may hinder gesture learning. Thus, we develop a two-step pose filter that rectifies occasional disappearance of the joint(s) coordinates and smooths the openpose output. The filter operates on a window of consecutive images (

is an adjustable odd number,

in this work), to replace the central image. Figure 2 shows an example. We note the image coordinates of the joint in the skeleton output by openpose at the -th image within the window. If openpose does not detect joint on image : .

In a first step, we replace coordinates of the missing joints. Only (we use ) consecutive replacements are allowed for each joint , and we monitor this via a coordinate replacement counter, noted . The procedure is driven by the following two equations:


Equation (1) states that the i-th joint at the latest (current) image is replaced by the same joint at the previous image under three conditions: if it is not detected, if it has been detected in all previous images, and if in the past it has not been replaced up to consecutive times already. If any of the conditions is false, we do not replace the coordinates and we reset the replacement counter for the considered joint: . Similarly, (2) states that the i-th joint coordinates over the window should not been taken into account i.e., joint will be considered missing, if it is not detected in the current image and if it has already been replaced more than consecutive times (we allow only consecutive replacements driven by (1)). This also resets the replacement counter value for the considered joint. Moreover, the i-th joint in all of the window’s images is set to its position in the current image if it has never been detected in the window up to the current image.

In a second step, we apply Gaussian smoothing to each , over the window of images. Applying this filter removes jitter from the skeleton pose and smooths out the joint movements in the image at the center of the filter window.

3.1.2 Skeleton Position and Scale Normalization

Fig. 1 includes a simple illustration of our goal for skeleton position and scale normalization. We focus on the 8 upper-body joints shown in Fig. 3: , with corresponding to the Neck joint, which we consider as root node. Position normalization consists in eliminating the influence of the user’s position in the image, by subtracting the Neck joint coordinates from those of the other joints. Scale normalization consists in eliminating the influence of the user’s depth. We do this by dividing the position-shifted joint coordinates by the neck depth , on each image, so the all joints are replaced according to:


Since our framework must work without requiring a depth sensor, we have developed a skeleton depth estimator to derive the neck depth, and use it instead of in (3). This estimator is a neural network, which maps a 97-dimensional pose vector, derived from the 8 upper body joint positions, to the depth of the Neck joint. We will explain it hereby.

Fig. 3: Feature augmentation of the upper body. In the left image, we show 8 upper-body joint coordinates (red), vectors connecting these joints (black) and angles between these vectors (green). From all upper-body joints, we compute a line of best fit (blue). In the right image, we show all the vectors (purple) between unique pairs of upper-body joints. We also compute the angles (not shown) between these vectors and the line of best fit. The resulting 97 components of this augmented pose vector are mapped to the depth of the neck joint (orange circle, pointed by red arrow), obtained from the Kinect V2 depth image.

3.1.3 Skeleton Depth Estimation

Inspired by [29], which demonstrated that augmenting pose coordinates may improve performance of gesture classifiers, we develop a 97 dimensional augmented pose vector (subscript n means Neck here) from 8 upper-body joint coordinates. From the joints coordinates, we obtain – via least squares – a line of best fit. In addition to 7 vectors from anatomically connected joints, 21 vectors between unique pairs of all upper-body coordinates are also obtained. The lengths of individual augmented vectors are also included in . We also include the 6 angles formed by all triplets of anatomically connected joints, and the 28 angles, between the 28 (anatomically connected plus augmented) vectors and the line of best fit. The resultant 97-dimensional augmented pose vector concatenates: 42 elements from abscissas and ordinates of the augmented vectors, their 21 estimated lengths and 34 relevant angles.

To obtain the ground-truth depth of Neck joint, denoted , we utilize OpenSign dataset, which contains Kinect V2 RGB+D images of persons. We apply our augmented pose extractor to all images in the dataset and – for each image – we associate to the corresponding Neck depth. A 9 layers neural network is then designed, to optimize parameters , given augmented pose vector and ground-truth to regress the approximate distance value with a mean squared error of . Formally:


We use this value of for scale normalization (3).

3.2 Focus on Hands Module

This module focuses on hands in two steps: first, by localizing them in the scene, and then by determining the size of their bounding boxes, in order to crop hand images.

3.2.1 Hand Localization

One way of localizing hands in an image is via detectors, possibly trained on hand images as in [30]. Yet, such strategies struggle to distinguish left and right hands, since they operate locally, thus lacking contextual information. To avoid this, we employ openpose for hand localization; specifically, the 42 (21 per hand) hand key-points that it detects on each image. We observed that these key-points are more susceptible to jitter and mis-detections than the skeleton joints, particularly on the low resolution videos of Chalearn 2016 dataset. Therefore, we apply the same filter of equations (1) and (2) to the raw hand key-points output by openpose. Then, we estimate the mean of all detected hand key-point coordinates , to obtain:


the hand center in the image.

3.2.2 Hand Bounding-box Estimation

Once the hands are located in the image, the surrounding image patches must be cropped for gesture recognition. Since at run-time our gestures recognition system relies only on RGB images (without depth), we develop two additional neural networks, and , to estimate each hand’s bounding box size. These networks are analogous to the one described in Sect. 3.1.2. Following the scale-normalization approach, for each hand we build a 54 dimensional augmented pose vector from 6 key-points. These augmented pose vectors ( and ) are mapped to the ground-truth hands depth values ( and ) obtained from OpenSign dataset, through two independent neural networks:


In (6) and (7), and are 9-layer neural networks that optimize parameters and given augmented poses and and ground-truth depths and , to estimate depths and . Mean squared error for and are and respectively. The size of the each bounding box is inversely proportional to the corresponding depth ( or ) obtained by applying (6) to the pure RGB images. The orientation of each bounding box is estimated from the inclination between corresponding forearm and horizon. The final output are the cropped images of the hands, and .

4 Video Data Processing

Our proposed spatial attention module conceptually allows end-to-end training of the gestures. However, we train our network in multiple stages to speed-up the training process (the details of which are given in Sect. 6

). Yet, this requires the videos to be processed step-by-step beforehand. This is done in four steps i.e, (1) 2D pose-estimation, (2) features extraction, (3) label-wise sorting and zero-padding and (4) train-ready data formulation. While prior 2D-pose estimation may be considered a compulsory step – even if the network is trained in an end-to-end fashion – the other steps can be integrated into the training algorithm.

4.1 Dynamic Features: Joints Velocities and Accelerations

As described in Sect. 3, our features of interest for gestures recognition are skeleton and hand images. The concept of augmented pose for scale-normalization has been detailed in Sect. 3.1.2. For dynamic gestures recognition, velocity and acceleration vectors from 8 upper-body joints, which contain information about the dynamics of motion, are also appended to the pose vector to form a new 129 components augmented pose . Inspired by [29], joint velocities and accelerations are computed as first and second derivatives of the scale-normalized joint coordinates. At each image :


The velocities and accelerations obtained from (8) and (9) are scaled by the video frame-rate to make values time-consistent, before appending them in the augmented pose vector .

For every frame output by the skeleton filter of Section 3.1.1, scale-normalized augmented pose vectors (as explained in 3.1.2) plus left and right hands cropped images (extracted as explained in Sect. 3.2) are appended in three individual arrays.

4.2 Train-Ready Data Formulation

The videos in Chalearn 2016 are randomly distributed. Once the features of interest (, and ) are extracted and saved in .h5 files, we sort them with respect to their labels. It is natural to expect the dataset videos (previously sequences of images, now arrays of features) to be of different lengths. The average video length in this dataset is 32 frames, while we fix the length of each sequence to 40 images in our work. If the length of a sequence is less than 40, we pad zeros symmetrically at the start and end of the sequence. Alternatively, if the length is greater than 40, we perform symmetric trimming of the sequence. Once the lengths of sequences are rectified (padded or trimmed), we append all corresponding sequences of a gesture label into a single array. At the end of this procedure, we are left with the 249 gestures in Chalearn 2016

dataset, along with an array of ground-truth labels. Each feature of the combined augmented pose vectors is normalized to zero mean and unit variance, while for hand images we perform pixel-wise division by the maximum intensity value (e.g., 255). The label-wise sorting presented in this section is only necessary if one wants to train a network on selected gestures (as we will explain in Sect. 

6). Otherwise, creating only a ground-truth label array should suffice.

5 CNN-LSTM for Dynamic Gesture Recognition

To model spatio-temporal dependencies for the classification of dynamic gestures, our static gestures detector is joined through its last Fully Connected (FC) layer with 3 LSTM networks stacked in a CNN-LSTM architecture as part of the proposed SaDNet (see Fig. 4). As explained in Sect. 3, our spatial attention module extracts augmented pose and hands of the user. Image embeddings of size 1024 elements for each hand are obtained from the last FC layer of our static hand gestures detector. Multiple modalities i.e., 129-components standardized augmented pose and image embeddings of 1024 elements for each hand, are fused in intermediate layers of the proposed network, which functions as a many-to-one classifier. The output of the last LSTM block is then sent to a FC dense layer followed by a softmax

layer to provide gestures class labels probabilities.

Fig. 4: Illustration of the proposed CNN-LSTM network for static and dynamic gestures recognition. The outputs of the last FC layers of our CNNs and the augmented pose vector are fused in the intermediate layers.

A dropout strategy is employed between successive layers to prevent over-fitting. Moreover, we exploit batch-normalization to accelerate training.

6 Training

The proposed network is trained on a computer with Intel© Core i7-6800K (3.4 GHz) CPU, dual Nvidia GeForce GTX 1080 GPUs and 64 GB system memory. We focus on the two datasets detailed below.

Chalearn 2016 dataset has 35,876 videos in the provided training set, with only the top 47 gestures (arranged in descending order of samples) representing 34 of all videos. The numbers of videos in the provided valid and test sets are 5784 and 6271 respectively. Thus, we utilize 12210 videos of 47 gestures to pre-train our CNN-LSTM with a validation split of . The learned parameters are then exploited to initialize weights for model training, to classify all 249 gestures. We exploit Adam optimizer to train our CNN-LSTM.

Praxis Cognitive Assessment Dataset is designed to diagnose apraxia and contains Kinect V2 RGB and depth images recorded by 60 subjects and 4 clinicians. It has 1247 videos for 14 correctly performed dynamic gestures. Given the small size of this dataset, we adapt the network hyper-parameters to avoid over-fitting.

7 Results

For Chalearn 2016 dataset, the proposed network is initially trained on 47 gestures with a low learning rate of . After approximately 66,000 iterations, a validation accuracy of 95.45 is obtained. The parameters learned for 47 gestures are employed to initialize weights for complete data training for 249 gestures as previously described. The network is trained in four phases. Weights initialization is performed, inspired by transfer learning concept of deep networks, by replacing the classification layer (with softmaxactivation function) by the same with output number of neurons corresponding to the number of class labels in the dataset. In our case, we replace the softmax layer in the trained network for 47 gestures plus the FC layer immediately preceding it. The proposed model is trained for 249 gestures classes with a learning rate of and a decay value of by Adam optimizer.

With the weights initialized, the early iterations are performed with all layers of the network locked except the newly added FC and softmax

layers. As the number of epochs increases, we successively unlock the network layers from the bottom (deep layers). In the second phase, network layers until the last LSTM block are unlocked. All LSTM blocks and then the complete model are unlocked, respectively in the third and fourth phase.

Fig. 5: Training curves of the proposed CNN-LSTM network for all 249 gestures of Chalearn 2016. The network is trained in four phases, distinguished by the vertical lines.

By approximately 2700 epochs, our CNN-LSTM achieves 86.69 validation accuracy for all 249 gestures and 86.75 test accuracy, surpassing the state-of-art methods on this dataset. The prediction time for each video sample is 57.17 ms excluding pre-processing of the video frames, thus continuous online dynamic gesture recognition can be achieved in real-time. The training curve of the complete model is shown in Fig. 5

while the confusion matrix/heat-map with evaluations on test set is shown in Figure 

6. Our results on Chalearn 2016 dataset are compared with the reported state-of-the-art in Table I.

Method Valid Test
SaDNet (ours) 86.69 86.75
FOANet [25] 80.96 82.07
Miao et al. [23] (ASU) 64.40 67.71
SYSU_IEEE 59.70 67.02
Lostoy 62.02 65.97
Wang et al. [43] (AMRL) 60.81 65.59
TABLE I: Comparison of the reported results with ours on Chalearn 2016. The challenge results are published in [37].
System Accuracy (dynamic gestures)
SaDNet (ours) 99.60
Negin et al. [27] 76.61
TABLE II: Comparison of dynamic gestures recognition results on Praxis gestures dataset; [27] also used a CNN-LSTM network.

Inspecting the training curves, we observe that the network is progressing towards slight over-fitting in the fourth phase when all network layers are unlocked. Specifically the first time-distributed FC layer is considered the culprit for this phenomenon. Although we already have a dropout layer immediately after this layer, with dropout rate equaling , we skip to further dive deeper to rectify this. However, it is assumed that substitution of this layer with the strategy of pose-driven temporal attention [2] or with the adaptive hidden layer [11], may help reduce this undesirable phenomenon and ultimately further improve results.

For Praxis dataset, the optimizer and values of learning rate and decay, are the same as for Chalearn 2016 dataset. The hyper-parameters including number of neurons in FC layers plus hidden and cell states of LSTM blocks are (reduced) adapted to avoid over-fitting. Our model obtains 99.6 test accuracy on 501 samples.

Fig. 6: Illustration of the confusion matrix/heat-map of the proposed CNN-LSTM model evaluated on test set of Chalearn 2016 isolated gestures recognition dataset. It is evident that most samples in the test set are recognized with high accuracy for all 249 gestures (diagonal entries, 86.75 overall).
Fig. 7: Illustration of the confusion matrix of the proposed CNN-LSTM model evaluated on test set of Praxis dataset. The diagonal values represent video samples correctly predicted for each class and the percentage represents their share (in terms of numbers) in the test set. sum_cols and sum_rows represent sum along the columns and rows of test accuracy for each class. Bottom right entry shows overall accuracy of the test set (containing 501 video samples).
Fig. 8: Snapshots of our gesture-controlled safe human-robot interaction experiment detailed in [21]. The human operator manually guides the robot to waypoints in the workspace then asks the robot to record them through a gesture. The human operator can transmit other commands to the robot like replay, stop, resume, reteach, etc with only hand gestures.

Results comparison on Praxis dataset with the state-of-the-art is shown in Table II. We also quantify the performance of our static hand gesture detector on a test set of 4190 hand images. The overall test accuracy is found to be 98.9. The normalized confusion matrix for 10 static hand gestures is shown in Figure 9.

Fig. 9: Normalized confusion matrix for our static hand gesture detector quantified on test-set of OpenSign. For more details please see [21].

We devised robotic experiments for gesture-controlled safe human-robot interaction tasks in [21]. These are preliminary experiments that allow the human operator to communicate with the robot through static hand gestures in real-time while dynamic gestures integration is yet to be done. The experiments were performed on BAZAR robot [5] which has two Kuka LWR 4+ arms with two Shadow Dexterous Hands attached at the end-effectors.

We exploited OpenPHRI [26]

, which is an open-source library, to control the robot while corroborating safety of the human operator. A finite state machine is developed to control behavior of the robot which is determined by the sensory information e.g., hand gestures, distance of the human operator from the robot, joint-torque sensing etc. The experiment is decomposed into two phases: 1) a teaching by demonstration phase, where the user manually guides the robot to a set of waypoints and 2) a replay phase, where the robot autonomously goes to every recorded waypoint to perform a given task, here force control. A video of the experiment is available online

111http://youtu.be/lB5vXc8LMnk and snapshots are given in Figure 8.

8 Conclusion

We proposed a unified framework for simultaneous recognition of static hands and dynamic upper-body gestures. We also present the idea of learning-based depth estimators, which predict the distance of the person and his/her hands, exploiting only the upper-body 2D skeleton coordinates. With this feature, monocular images are sufficient and the framework does not require depth sensing. Thus our framework can be integrated into any cyber-physical system and Internet of Things to intuitively control smart devices.

Our pose-driven spatial attention mechanism, which focuses on upper-body pose for large-scale body movements of the limbs plus on hand images for subtle hand/fingers movements, enabled SaDNet to out-score the existing approaches on the datasets employed.

The presented weight initialization strategy facilitated parameters optimization for all 249 gestures when the number of samples among the classes varied substantially in the dataset. Our static gestures detector outputs the label frame-wise in real-time at approximately 21 fps with the state-of-the-art recognition accuracy. However, class recognition for dynamic gestures is performed on isolated gestures videos executed by a single individual in the scene. We plan to extend this work for continuous dynamic gestures recognition to demonstrate its utility in human-machine interaction. This can be achieved in one way by developing a binary motion detector to detect start and end instances of the gestures. Although a multi-stage training strategy is presented, we envision an end-to-end training approach for online learning of new gestures.


  • [1] M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt (2011) Sequential Deep Learning for Human Action Recognition. In Int. Workshop on Human Behavior Understanding, pp. 29–39. Cited by: §2.1.
  • [2] F. Baradel, C. Wolf, and J. Mille (2017) Human Action Recognition: Pose-based Attention Draws Focus to Hands. In Proc. of the IEEE Int. Conf. on Computer Vision, pp. 604–613. Cited by: §2.6, §2.7, §7.
  • [3] F. Baradel, C. Wolf, and J. Mille (2017) Pose-conditioned Spatio-temporal Attention for Human Action Recognition. arXiv preprint arXiv:1703.10106. Cited by: §2.6, §2.7.
  • [4] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In

    IEEE Conf. on Computer Vision and Pattern Recognition

    Cited by: §3.1.
  • [5] A. Cherubini, R. Passama, B. Navarro, M. Sorour, A. Khelloufi, O. Mazhar, S. Tarbouriech, J. Zhu, O. Tempier, A. Crosnier, et al. (2019) A collaborative robot for the factory of the future: bazar. The International Journal of Advanced Manufacturing Technology 105 (9), pp. 3643–3659. Cited by: §7.
  • [6] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie (2005) Behavior Recognition via Sparse Spatio-Temporal Features. In 2005 IEEE Int. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72. Cited by: §2.
  • [7] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell (2015) Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2625–2634. Cited by: §1, §2.4, §2.7.
  • [8] M. A. Goodale and A. D. Milner (1992) Separate Visual Pathways for Perception and Action. Trends in Neurosciences 15 (1), pp. 20–25. Cited by: §2.3.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §2.
  • [10] S. Hochreiter and J. Schmidhuber (1997) Long Short-Term Memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §2.1.
  • [11] T. Hu, Y. Lin, and P. Hsiu (2018) Learning Adaptive Hidden Layers for Mobile Gesture Recognition. In

    Thirty-Second AAAI Conf. on Artificial Intelligence

    Cited by: §7.
  • [12] H. Idrees, A. R. Zamir, Y. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah (2017) The THUMOS Challenge on Action Recognition for Videos “In the Wild”. Computer Vision and Image Understanding 155, pp. 1–23. Cited by: §2.5.
  • [13] S. Ji, W. Xu, M. Yang, and K. Yu (2012) 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence 35 (1), pp. 221–231. Cited by: §2.1.
  • [14] P. Jung, G. Lim, S. Kim, and K. Kong (2015) A wearable gesture recognition device for detecting muscular activities based on air-pressure sensors. IEEE Trans. on Industrial Informatics 11 (2), pp. 485–494. Cited by: §1.
  • [15] V. Kantorov and I. Laptev (2014) Efficient Feature Extraction, Encoding and Classification for Action Recognition. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2593–2600. Cited by: §2.
  • [16] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. F. Li (2014) Large-scale Video Classification with Convolutional Neural Networks. Proc. of the IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §2.1, §2.2, §2.3, §2.4.
  • [17] J. Kofman, X. Wu, T. J. Luu, and S. Verma (2005) Teleoperation of a robot manipulator using a vision-based human-robot interface. IEEE Trans. on Industrial Electronics 52 (5), pp. 1206–1219. Cited by: §1.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §2.5.
  • [19] I. Laptev (2005) On Space-time Interest Points. Int. Journal of Computer Vision 64 (2-3), pp. 107–123. Cited by: §2.
  • [20] G. Li, H. Wu, G. Jiang, S. Xu, and H. Liu (2018) Dynamic gesture recognition in the internet of things. IEEE Access 7, pp. 23713–23724. Cited by: §1.
  • [21] O. Mazhar, B. Navarro, S. Ramdani, R. Passama, and A. Cherubini (2019) A Real-time Human-Robot Interaction Framework with Robust Background Invariant Hand Gesture Detection. Robotics and Computer-Integrated Manufacturing 60, pp. 34–48. Cited by: §1, Fig. 8, Fig. 9, §7.
  • [22] O. Mazhar (2019) OpenSign - Kinect V2 Hand Gesture Data - American Sign Language. Cited by: §1.
  • [23] Q. Miao, Y. Li, W. Ouyang, Z. Ma, X. Xu, W. Shi, and X. Cao (2017) Multimodal Gesture Recognition based on the ResC3D Network. In Proc. of the IEEE Int. Conf. on Computer Vision, pp. 3047–3055. Cited by: §2.2, §2.3, TABLE I.
  • [24] V. Mnih, N. Heess, A. Graves, et al. (2014) Recurrent Models of Visual Attention. In Advances in Neural Information Processing Systems, pp. 2204–2212. Cited by: §2.6, §2.6.
  • [25] P. Narayana, R. Beveridge, and B. A. Draper (2018) Gesture Recognition: Focus on the Hands. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 5235–5244. Cited by: §2.6, §2.7, TABLE I.
  • [26] B. Navarro, A. Fonte, P. Fraisse, G. Poisson, and A. Cherubini (2018) In pursuit of safety: an open-source library for physical human-robot interaction. IEEE Robotics & Automation Magazine 25 (2), pp. 39–50. Cited by: §7.
  • [27] F. Negin, P. Rodriguez, M. Koperski, A. Kerboua, J. Gonzàlez, J. Bourgeois, E. Chapoulie, P. Robert, and F. Bremond (2018) PRAXIS: Towards Automatic Cognitive Assessment Using Gesture Recognition. Expert Systems with Applications. Cited by: TABLE II.
  • [28] N. Neverova, C. Wolf, G. Paci, G. Sommavilla, G. Taylor, and F. Nebout (2013) A Multi-scale Approach to Gesture Detection and Recognition. In Proc. of the IEEE Int. Conf. on Computer Vision Workshops, pp. 484–491. Cited by: §2.2.
  • [29] N. Neverova, C. Wolf, G. W. Taylor, and F. Nebout (2014) Multi-scale Deep Learning for Gesture Detection and Localization. In European Conf. on Computer Vision, pp. 474–490. Cited by: §1, §2.2, §2.7, §3.1.3, §4.1.
  • [30] P. Panteleris, I. Oikonomidis, and A. Argyros (2018) Using a Single RGB Frame for Real time 3D Hand Pose Estimation in the Wild. In 2018 IEEE Winter Conf. on Applications of Computer Vision (WACV), pp. 436–445. Cited by: §3.2.1.
  • [31] H. S. Park, D. J. Jung, and H. J. Kim (2006) Vision-based Game Interface using Human Gesture. In Pacific-Rim Symposium on Image and Video Technology, pp. 662–671. Cited by: §1.
  • [32] R. A. Rensink (2000) The Dynamic Representation of Scenes. Visual Cognition 7 (1-3), pp. 17–42. Cited by: §1, §2.6.
  • [33] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek (2013) Image classification with the Fisher Vector: Theory and Practice. Int. Journal of Computer Vision 105 (3), pp. 222–245. Cited by: §2.
  • [34] K. Simonyan and A. Zisserman (2014) Two-stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems, pp. 568–576. Cited by: §2.3, §2.3.
  • [35] K. Simonyan and A. Zisserman (2014) Very Deep Convolutional Networks for Large-scale Image Recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
  • [36] D. Tran, J. Ray, Z. Shou, S. Chang, and M. Paluri (2017) ConvNet Architecture Search for Spatiotemporal Feature Learning. arXiv preprint arXiv:1708.05038. Cited by: §2.2.
  • [37] J. Wan, S. Escalera, G. Anbarjafari, H. Jair Escalante, X. Baró, I. Guyon, M. Madadi, J. Allik, J. Gorbova, C. Lin, et al. (2017) Results and Analysis of Chalearn LAP Multi-modal Isolated and Continuous Gesture Recognition, and Real versus Fake Expressed Emotions Challenges. In Proc. of the IEEE Int. Conf. on Computer Vision, pp. 3189–3197. Cited by: TABLE I.
  • [38] J. Wan, Y. Zhao, S. Zhou, I. Guyon, S. Escalera, and S. Z. Li (2016) Chalearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition Workshops, pp. 56–64. Cited by: §1, §2.2.
  • [39] H. Wang, A. Kläser, C. Schmid, and C. Liu (2011) Action Recognition by Dense Trajectories. In IEEE Conf. on Computer Vision and Pattern Recognition, Vol. , pp. 3169–3176. Cited by: §2.
  • [40] H. Wang and C. Schmid (2013) Action Recognition with Improved Trajectories. In 2013 IEEE Int. Conf. on Computer Vision, Vol. , pp. 3551–3558. Cited by: §2.
  • [41] H. Wang, D. Oneata, J. Verbeek, and C. Schmid (2016) A Robust and Efficient Video Representation for Action Recognition. Int. Journal of Computer Vision 119 (3), pp. 219–238. Cited by: §2.
  • [42] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid (2009) Evaluation of local spatio-temporal features for action recognition. Proc. of the British Machine Vision Conf. 2009, pp. 124.1–124.11. Cited by: §2.
  • [43] H. Wang, P. Wang, Z. Song, and W. Li (2017) Large-scale Multimodal Gesture Recognition using Heterogeneous Networks. In Proc. of the IEEE Int. Conf. on Computer Vision, pp. 3129–3137. Cited by: TABLE I.
  • [44] L. Wang, Y. Qiao, and X. Tang (2015) Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 4305–4314. Cited by: §2.3.
  • [45] S. Xingjian, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015)

    Convolutional LSTM network: A Machine Learning Approach for Precipitation Nowcasting

    In Advances in Neural Information Processing Systems, pp. 802–810. Cited by: §2.4.
  • [46] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. Fei-Fei (2018)

    Every Moment Counts: Dense Detailed Labeling of Actions in Complex Videos

    Int. Journal of Computer Vision 126 (2-4), pp. 375–389. Cited by: §2.5, §2.7.
  • [47] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici (2015) Beyond Short Snippets: Deep Networks for Video Classification. In Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 4694–4702. Cited by: §2.4, §2.7.
  • [48] J. Zheng, X. Cao, B. Zhang, X. Zhen, and X. Su (2018) Deep ensemble machine for video classification. IEEE Trans. on Neural Networks and Learning Systems 30 (2), pp. 553–565. Cited by: §2.1.
  • [49] Z. Zheng, G. An, D. Wu, and Q. Ruan (2020) Global and local knowledge-aware attention network for action recognition. IEEE Trans. on Neural Networks and Learning Systems. Cited by: §2.6.
  • [50] G. Zhu, L. Zhang, L. Yang, L. Mei, S. A. A. Shah, M. Bennamoun, and P. Shen (2019) Redundancy and Attention in Convolutional LSTM for Gesture Recognition. IEEE Trans. on Neural Networks and Learning Systems. Cited by: §2.4.