Log In Sign Up

Dynamic Gesture Recognition by Using CNNs and Star RGB: a Temporal Information Condensation

With the advance of technologies, machines are increasingly present in people's daily lives. Thus, there has been more and more effort for developing interfaces, such as dynamic gestures, that provide an intuitive way of interaction. Currently, the most common trend is to use multimodal data, as depth and skeleton information, to try to recognize dynamic gestures. However, the use of only color information would be more interesting, once RGB cameras are usually found in almost every public place, and could be used for gesture recognition without the need to install other equipment. The main problem with this approach is the difficulty of representing spatio-temporal information using just color. With this in mind, we propose a technique that we called Star RGB, capable of describing a videoclip containing a dynamic gesture as an RGB image. This image is then passed to a classifier formed by two Resnet CNN's, a soft-attention ensemble, and a multilayer perceptron, which returns the predicted class label that indicates to which type of gesture the input video belongs. Experiments were carried out using the Montalbano and GRIT datasets. On the Montalbano dataset, the proposed approach achieved an accuracy of 94.58 considering only color information. On the GRIT dataset, our proposal achieves more than 98 reference approach in more than 6


page 6

page 7

page 9

page 10

page 14


3D dynamic hand gestures recognition using the Leap Motion sensor and convolutional neural networks

Defining methods for the automatic understanding of gestures is of param...

CNN+RNN Depth and Skeleton based Dynamic Hand Gesture Recognition

Human activity and gesture recognition is an important component of rapi...

Gesture Recognition in RGB Videos UsingHuman Body Keypoints and Dynamic Time Warping

Gesture recognition opens up new ways for humans to intuitively interact...

Statistical and Spatio-temporal Hand Gesture Features for Sign Language Recognition using the Leap Motion Sensor

In modern society, people should not be identified based on their disabi...

Large-scale Continuous Gesture Recognition Using Convolutional Neural Networks

This paper addresses the problem of continuous gesture recognition from ...

Literature on Hand GESTURE Recognition using Graph based methods

Skeleton based recognition systems are gaining popularity and machine le...

1 Introduction

In Human-Machine Interaction (HMI) research field, several communication interfaces can be used, such as manual controls, brain-reading devices, speech and gestures, among others. However, gestures and speech are the least intrusive and most natural ones. Between speech and gestures, there are many cases where gestures are preferred rather than speech because they do not have a mandatory grammar component in their elaboration (mcneill1992hand, ).

Gestures can be classified as static, i.e., do not change their form within a period of time, or dynamic, formed by a set of static gestures (poses) that vary within a time interval. According to (mcneill1992hand, ), natural gestures, the most intuitive ones, are in majority dynamic and performed by hands. Therefore, when attempting to use gesture effectively in HMI, the static ones play a less significant role. Due to the importance of gestures for HMI, for decades several studies have concentrated efforts on increasingly efficient recognizers. A list of the most used recognizers can be seen in (mitra2007survey, ; rautaray2015vision, ; dynamic2016survey, ), where (dynamic2016survey, ) shows only those intended to recognize dynamic gestures.

Some of those works mentioned above use more than one information to achieve better results, being called multimodal methods (neverova2016moddrop, ). They usually apply color (RGB format), depth measurements and skeleton joints to detect and recognize gestures. This multiple source information, such as depth and skeleton joints, complement positively the color information, which facilitates the separation of classes by the recognizer (multimodal2017challenges, ). However, to acquire the depth information in interactive environments, as well as other information besides RGB, it is usually necessary to have specific sensors, for example Microsoft Kinect111, Asus Xtion Pro 222 or Intel Realsense 333

Such dependence on specific sensors generates a restriction on the interactional environment. Therefore, environments that already have RGB cameras, such as surveillance cameras that are easily found in many private or public spaces, can not be used by such recognizers. This is one of the reasons that motivates studies for the development of recognition methods based only on color information.

Some works use color as the only source of information, but just a few of them obtain significant results, such as (barros2014real, ; multimodal2017challenges, ). Even so, the good results mentioned in those works are achieved using their particular gesture vocabularies, where the gestures are significantly different from each other, which simplify the gesture recognition task. Thus, to improve HMI, we believe that it is necessary to develop new techniques capable of recognizing dynamic gestures based only on color information, and that can deal with gestures not easily distinguishable from one another.

Recently, Deep Learning (DL) has achieved the state of art of various problems within the areas of image processing and computer vision 

(liu2017survey, ; guo2016deep, ). Among the architectures used in DL, Convolutional Neural Network (CNN) is currently the architecture with the better results in the computer vision field. This type of network has the property of sharing weights, which allows understanding the relationship between weights and input data as convolution operations. In this way, CNNs have several filters which can be trained to extract specific features from the data.

As mentioned previously, recognizing dynamic gestures is not a simple task due to its temporal nature. Thereby, even using DL, sophisticated techniques are needed to capture temporal information, for instance, to action recognition task often are used complex DL architectures as 3D-convolution (3dConvolution_for_HAR, ; 3d_har2018, ), Two-streams (zisserman2014, ; zisserman2016, ) or two-streams with 3D-convolution (zisserman2017, ). These kinds of approaches require more processing power, implying an important restriction of its usage in environments where the real-time response is a priority. Therefore, to tackle this problem, (barros2014real, )

used a technique (called here as Star representation) capable of condensing temporal information of gestures in a single grayscale image. For this, they used the history of the movement, which is captured through the sum of the modules of the differences between consecutive frames. With this approach, the problem of gesture recognition in videos can be seen as an image classification problem. Hence, using transfer learning is possible to recognize dynamic gestures utilizing CNNs pretrained for another image recognition task. Even though this approach can be interesting for gesture recognition, it has two main drawbacks: (

i) gestures that can be distinguished only by its temporal sequence are particularly challenging to recognize since the temporal information was reduced entirely to a spatial representation; (ii) the process results in just one grayscale image, which may hamper the use of transfer learning, once CNNs pretrained often receive as input an image with three channels.

To try solving these problems, we propose a dynamic gesture recognizer based only on color information. Therefore, the contributions of this work are: (i) a new Star representation for the gestures calculated from each input video, which is an RGB image and can encode more temporal information;(ii) a dynamic gesture classifier based on Deep Learning architecture composed by an ensemble of CNNs retrained and fused by a soft-attention mechanism.

To detail and explain our proposal, this paper is structured as follows: Section 2 brings the related works and a brief explanation about the original star technique used to represent temporal information of gestures; Section 3 describes our proposal; Section 4 presents and discusses our experiments and results; and finally, Section 6 brings our conclusions and future works.

2 Related Works

Dynamic gesture recognition is a research field of increasing interest in the last years. Several previous works have focused on trying to develop a more intuitive interface to interact with machines and other devices. However, because a gesture can be represented in different ways within the same context, recognizing a dynamic gesture is a problem with a high level of difficulty. Besides that, the way gestures are performed depends not just on the sequence of body motions but also on the cultural aspect of the people that employ the gestures. Consequently, there is still a lot of work to be made to achieve an interface capable of providing effective communication between humans and machines based on gestures.

For that reason, well-structured datasets that represent various gesture meanings are of paramount importance for the development of this research area. Competitions like Chalearn: look at people (LAP), cast a challenge for gesture recognition by releasing the Montalbano datasets V1 (escalera2013multi, ) and V2 (escalera2014, ), that fulfill some requirements aforementioned. These datasets comprise approximately 14000 gestures taken from 27 different subjects for 20 distinct classes. The sensor used to capture the data was a Microsoft Kinect 360. Thus, the data has multimodal nature, since it consists of RGB, depth, user mask, 3D skeleton joints, and audio. The main difference between the V1 and V2 versions, further than a better annotation in the second version, is the lack of audio information, which is present only in the first one.

Even though they are not datasets focused on HMI, due to their importance to the dynamic gesture recognition task, many works test their approaches on them, such as (neverova2016moddrop, ; Efthimiou2016, ; LiChuankun2017, ; Joshi2017, ; Wang2017, ). Analyzing such works, they can be separated into two main groups: first, the works that aim to solve the original problem of the challenge (recognizing a sequence of dynamic gesture from a video) and secondly, the works that classify the segmented gestures by using the labels provided by the datasets. Both groups may recognize the gestures by using either multimodal data or only one sort of information.

Among the first group, which tried to solve the original challenge problem, most approaches have used multimodal information. For instance, (neverova2016moddrop, ) has used all information available on both datasets. They applied all the different information to a set of CNNs. As a result, the authors obtained an accuracy of when they used all the data plus the audio, using only 3D skeleton and using RGB and Depth (RGB-D).

Some works used other different sources of data aiming at the same goal. For example, (Neverova2015, ) used depth and skeleton joints to achieve a Levenshtein Distance (LD) of , while (Georgios2014, ) used skeleton joints, RGB and audio, reaching .

In (Pigou2018, ), the authors applied RGB and depth information as inputs to a CNN followed by an LSTM classifier (lstm, )

. With this architecture, they achieved a Jaccard Index (JI) of

using RGB-D information, and using only RGB. One restriction with this approach is that the gesture should be performed within a time window of frames. Consequently, when the gesture is performed in more or less than frames, the model does not work correctly.

The works belonging to the second group, which aimed to classify the segmented gestures, usually used the labels and segmented gestures provided with each dataset to train and test their approaches. Differently, from the first group, their primary objective is to classify/recognize the gestures, without tackling both the problems of spotting and recognizing the gestures in a video sequence.

Among this group, the work presented by (LiChuankun2017, ) achieved

of accuracy using only the skeleton information. The authors represent the skeleton joints as a vector, and the sequence of skeletons as an image, where the

coordinates of each joint represent a (,,) pixel in the image. Using different representations of skeleton information, (liu20193d, ; liu2019hidden, ) achieved respectively and . On the other hand, in (ChenXi2014, ), the authors used depth and skeleton information to classify the gestures, achieving only of accuracy. Others works such as (ChenXi2014, ; Yao2014, ; Wu2014, ; Fernando2015, ; Escobedo-Cardenas2015, ; Joshi2017, ) also used different types of information, but did not achieve a better result than (liu2019hidden, ). Also, in these works, the videos were cut into clips containing only the gestures; this clipping process was implemented using the labels related to each gesture present in the dataset. It is important to mention that skeleton joints are a data source that condenses dynamic and structure information related to gesture. Consequently, it is indicated to be used for dynamic gesture recognition. However, this type of data is usually provided by specific sensors, like the Microsoft Kinect, which was utilized to acquire the mentioned dataset. Table 1 shows a summary of the best results achieved over Montalbano datasets for different types of data.

Since the Kinect-like sensors are not easily found in ordinary places, which usually have regular surveillance cameras, approaches that use only RGB images are most preferred than multimodal methods, that need more than one type of information to work.

Work Used Data Type Result
(neverova2016moddrop, ) Audio, RGB-D and Skeleton 96.81%
(neverova2016moddrop, ) RGB-D 95.06%
(neverova2016moddrop, ) Audio 94.96%
(liu2019hidden, ) Skeleton 93.8%
(Efthimiou2016, ) Audio and RGB 93.00%
(Escobedo-Cardenas2015, ) Skeleton and RGB-D 88.38%
(Wu2016, ) Depth 82.62%
(Fernando2015, ) Skeleton and Audio 80.29%
(CongqiCao2015, ) RGB 60.07%
Table 1: The main works that use the Montalbano datasets

Until now, works that perform gesture recognition using only RGB information reach at most

of accuracy in the Montalbano datasets. Two possibilities can explain these results: either the approaches proposed until this moment were focused on recognizing gestures using only multimodal information and did not make a significant effort on using just one information type, or the gestures are too difficult to distinguish one from another when it is used exclusively the color information.

Considering only RGB images, the authors from (barros2014real, ) used a variant of the Motion History Image (MHI) technique (MHIrepresentation, ) to represent the movement information contained in a video sequence. This representation, denoted here as Star, was used jointly with a CNN to recognize dynamic gestures. The results reported using a dataset developed by authors was of of accuracy. Here is important to mention that the gestures of this dataset are very different between classes and therefore, can be easily recognized. Nevertheless, we consider that the Star representation is particularly interesting and can be applied in more challenging datasets. As such, in the next paragraphs, we will give a detailed description of the Star representation calculation.

Considering a grayscale video with frames containing a dynamic gesture. The Star representation can be calculated into two main steps, as follow:

First step, the accumulated sum of the absolute difference between consecutive video frames is calculated. The Equation (1) represents this process.


where: , called memory size, is the number of frames of each video clip containing a gesture; are the coordinates of a pixel in a frame; M is the star image; is the frame of a video containing a dynamic gesture; is the modulus operator and is called weighted shadow, and it is responsible for pondering the absolute difference depending on .

Second step, Sobel masks are applied over the matrix M in the and directions, obtaining the matrices and , respectively. Finally, the star representation is defined as three grayscaled images, corresponding to the matrices M, and .

According to the authors, the two extra channels and provide better discrimination about the type of movement present in M. However, when using a CNN network, there is a possibility that the first convolutional layers had already learned and assumed these type of masks during the training stage. What makes needless the application of the Sobel masks over the matrix M.

With this in mind, this work proposes some changes over the Star representation in order to generate an RGB image that encodes the gesture movement, contained in a video. The new representation is applied to a DL architecture based on an ensemble of two CNNs fused by a soft-attention mechanism. This proposal was evaluated over the Montalbano and GRIT (tsironi2017analysis, ) datasets, achieving a higher accuracy in the gesture recognition process in both datasets. We believe the proposed changes in the star representation, as well as the DL architecture used to solve this problem, can be considered the two main contributions of this work.

3 Proposal

In this section, the proposal for our dynamic gesture recognition technique is described. It consists of two main steps, specifically:

  • Pre-processing: each input video is represented as an RGB image by using a modified version of the aforementioned Star representation.

  • Classification: a dynamic gesture classifier is trained using an ensemble of CNNs. Specifically, the image obtained from the pre-processing step, is passed as input to two pre-trained CNNs. The results of these two CNNs pass through a soft-attention mechanism, and then, after being weighted, the results are sent to a fully connected layer. Finally, a softmax classifier indicates the possible class that the gesture may belong.

Details of these two steps are given below.

3.1 Pre-processing

The original Star representation does not take into account the color information of each frame. Consequently, the resulting representation is a grayscale image, calculated using the Equation (1). Thus, our first goal is to represent temporal information present in a color video by using an improvement of the proposal in (barros2014real, ).

Therefore, to take advantage of the color information, the difference between two consecutive frames, calculated as Equation (1) in (barros2014real, ), can be replaced by the Euclidean Distance as presented in Equation (2).


where represents the RGB vector of a pixel at the position of the frame; and represents the norm.

However, Euclidean Distance considers only the vector norm and can just evaluate the intensity of each image. Thus, a better solution would be to include information from both magnitude and phase when calculating the distance between RGB vectors. That would allow us to evaluate not just the changes in image intensity, but the hue and saturation as well. In this sense, (samatelo2012new, )

proposes a metric based on cosine similarity, Equations (

3) and (4), that we decided to use to built a modified Star representation.


where is the angle between and .


Since , we can use (4) in (2) to substitute the original Star representation (Equation (1)) by Equation (5) and end up with an improved expression instead.


From Equation (5), the distance between two consecutive images can be calculated using the difference of intensities, scaled by a number that depends on the angle between each RGB pixel of the images. Therefore, both intensity and chromaticity information are taken into account. Figure 1 shows the results of both Equations (1) (star representation) and (5) (modified star representation) and their differences applied to a video.

Figure 1: Comparing the approaches. a) star representation proposed by (barros2014real, ) b) our modified star representation c) the difference between them.

Notice that with the new proposal, it is possible to extract more information from the movement. However, even with this new representation, the result is still a grayscale image and does not match state-of-the-art CNN architectures, which usually receive an RGB image as input. Besides that, this approach still does not solve the problem of temporal information loss. Then, movements with the same path but performed with different directions will have similar representations.

Thus, to improve the temporal representation and, simultaneously, create an RGB image as an output, we propose an approach summarized in Figure 2 and explained below:

  • Each color video containing a complete dynamic gesture must be divided equally into three sub-videos of frames, which likely will represent the pre-stroke, stroke and post-stroke steps of a dynamic gesture, as defined in (mcneill1992hand, ) and discussed by (liu2019hidden, ). If the number of frames is not divisible by three, the middle sub-video will contain frames;

  • For each resulting color sub-video, the matrix M of the Star representation is calculated using Equation (5). As a result, each video is represented by an RGB image, where the -channel contains the M matrix calculated from the first sub-video, the -channel has the M matrix from the central one, and the -channel the M matrix from the last one.

Figure 2: Star RGB representation for the gesture basta.

Because the output of our approach is now an RGB image, we decided to call such image as Star RGB representation.

Notice that, besides resulting in a multichannel and sparse representation of color video, we observed that the Star RGB representation has another advantage when gestures are distinguished only by the direction of the movement of hands, like beckoning a direction to someone. This consideration is illustrated simulating two gestures with the same movements but with opposite directions. For instance, Figure 3 shows the Star and Star RGB representations calculated from two videos, where the second video is generated by reversing the frame sequence of the first one. Hence, if we compare Figures (a)a and  (b)b, we can notice that the original Star representation produces similar grayscale images for both videos, which is evident from Figure (c)c that shows the difference between them. In contrast, the Star RGB produces two different color images by each video, as it is shown in Figures  (d)d(e)e and (f)f. Where Figure  (f)f show the difference between each channel of the images (d)d and (e)e in RGB format.

This result suggests that the Star RGB improves the original representation, encoding more temporal information than the previous approach presented in (barros2014real, )

. Besides that, it will probably make the classifier model equivariant to similar movements that represent different gestures.

Figure 3: Comparing results between Star and Star RGB approaches. a) Star representation from the video in its original frame sequence; b) Star representation from the video in its inverted frame sequence; c) the difference between a) and b); d) Star RGB from the video in its original frame sequence; e) Star RGB from the video in its inverted frame sequence; f) the difference between d) and e). Image c) is almost zero. Most of the pixel values are close to zero (something about 1e-7), but were normalized so we could see the difference between images (a) and (b).

In summary, the Star RGB: capture the color information, represents better the temporal information of a video, is suitable for training models based on state-of-the-art CNNs for image classification, and finally, its calculation is simple and fast.

3.2 Classification

After pre-processing a video sequence to build the corresponding Star RGB representation, the next step is classification. Our approach for the dynamic gesture classifier, shown in Figure 4, includes three parts: (i) a feature extractor, based on pre-trained CNNs; (ii) an ensemble of CNNs, where features are fused by a soft-attention mechanism; (iii

) and a classifier, formed by fully connected layers and an output softmax layer. Each one of these parts is explained as follows:

Figure 4: Proposed of dynamic gesture classifier.

3.2.1 Feature extractor

This part is based on the CNN Resnet (resnet, ), which is specialized on image classification task, and was previously trained on the dataset ImageNet. Such dataset was released in the ILSVRC-2014 competition and contains more than million images distributed over distinct categories (russakovsky2015imagenet, ).

We chose the Resnet because it is one of the most used CNNs to classify images. Moreover, its architecture achieves the state-of-the-art in several image classification problems. These excellent results are a consequence of the residual blocks, which can mitigate the vanishing gradient problem even in a very deep architecture 

(resnet, ).

Once the pre-processing step generates an RGB image that represents a dynamic gesture, it can be used as the input to a pre-trained Resnet. However, in order to achieve better results, we decided to use two Resnets in parallel (the Resnet and Resnet ). Each Resnet was cut at the convolution layer, that corresponds to a different number of residual blocks for each one of them.

3.2.2 Ensemble

The feature maps corresponding to the convolution layer of each Resnet are passed sequentially through an attention mechanism, called here as the soft-attention ensemble. This mechanism plays a role of weighting each feature map according to its importance to the predicted class.

The soft-attention ensemble was chosen considering that: (i

) it should evaluate the information according to its importance for the task, differently from other fusion types as summation and arithmetic mean or even max pooling; (

ii) the sequential nature of soft-attention ensemble avoids the concatenation of all feature maps of the inputs, which would increase the size of the input vector generated by the ensemble.

The architecture of the soft-attention ensemble is shared by all feature maps and is composed of a fully connected layer with 128 neurons using the activation function ReLU 

(relu2010rectified, ), and an output layer of one neuron without activation function. Thus, it receives a vector (a flattened CNN feature map) as input and outputs just one value.

The sequential operation of the soft-attention ensemble is shown in Figure 5 . Let be the number of CNN feature maps that will be fused by the soft-attention. First, each flattened CNN feature map is applied over the soft-attention ensemble, generating as output a vector with elements, that is normalized using a softmax function. After that, the weighted sum of the flattened feature maps is calculated using as weighting coefficients the elements of the normalized vector. In this work, we use , given that the feature extractor is based on two CNNs. Note that, when using this mechanism, all feature maps must have the same dimensions.

Figure 5: Soft-attention ensemble.

It is important to mention that a Batch Normalization 

(ioffe2015batch, ) was applied before the feature maps can be passed to the soft-attention. This step is necessary because each feature map probably has different distributions. Therefore, using the soft-attention mechanism directly would not give an accurate result about the importance of each feature map for the problem.

3.2.3 Classifier

The output of the soft-attention ensemble feeds a classifier, which is composed of a hidden layer of 1024 neurons with Batch Normalization, dropout (srivastava2014dropout, ) and ReLU activation function; and an output layer of 20 neurons with softmax activation function. A softmax activation function over the last layer gives the probability that the input image belongs to one of the possible gestures of the dataset.

In complement, the cost function used is the mean of the Cross-entropy, calculated for each minibatch and regularized by the norm of the entire set of weights, which includes those of the feature extractor, ensemble, and classifier. This type of regularization was applied to force the sparsity of the weights, what, in theory, can better handle input images that also are sparse (see Figure 2).

4 Experiments and results

This section addresses the datasets used throughout the experiments, the implementation and training of the proposed architecture and, finally, the results obtained in the evaluation step.

In short, we performed two different experiments. The first experiment was carried out to evaluate the proposed architecture and the impact of soft-attention ensemble on the gesture classifier performance. The second experiment aimed to evaluate the use of the Star RGB representation within the proposed architecture. These experiments were conducted with different datasets, detailed as follows.

4.1 Montalbano gesture dataset

This dataset was cast in the challenge Chalearn: looking at people - 2014 (escalera2014, ), and comprises of cultural/anthropological Italian gestures ( for training, for validation and for testing), distributed in different types. All the gestures were captured using a Kinect sensor and have multimodal information: RGB, depth, skeleton and user mask. In the challenge, the candidates could use any provided data to recognize a sequence of gestures (between and per video) performed in a continuous video. Figure 6 shows a representation of the gestures contained in this dataset.

We decided to use this dataset because it is one of the largest sets currently released that focuses on the problem of dynamic gesture recognition. Also, as mentioned by (tsironi2017analysis, ), the available datasets of dynamic gestures based on RGB are very small and are captured focusing only on the information of the hands and not on the entire body.

Although the original challenge in (escalera2014, ) was to use multimodal data to recognize dynamic gestures, we believe that only the RGB information can be successfully used for this purpose.

Our main motivation comes from the fact that there is a large number of surveillance cameras in many public places, where gestures could be used to promote the interaction between humans, devices and also the environment. Therefore, being able to recognize gestures only from color information is an attractive possibility for an HMI that may be deployed in spaces having just standard cameras installed.

Figure 6: Representation of the 20 gestures present in the Montalbano gesture dataset (escalera2013multi, ).

As the participation in the competition was not the goal of this work, we segmented the gestures in several videos, according to the provided labels. Thus, each video of training or test set contains only one dynamic gesture. Now, the problem is no longer to recognize a sequence of gestures in a video but to classify videos in different classes of dynamic gestures.

After all the videos were segmented, the proposed technique was applied, as presented in (Section 3.2). Figure 7 illustrate a sample of each gesture shown in Figure 6 after calculating the Star RGB representation.

(a) vattene
(b) vieniqui
(c) perfetto
(d) furbo
(e) cheduepalle
(f) chevuoi
(g) daccordo
(h) seipazzo
(i) combinato
(j) freganiente
(k) ok
(l) cosatifarei
(m) basta
(n) prendere
(o) noncenepiu
(p) fame
(q) tantotempo
(r) buonissimo
(s) messidaccordo
(t) sonostufo
Figure 7: A Star RGB representation of one sample of each gesture present in the Montalbano dataset.

4.2 GRIT gesture Dataset

In order to evaluate our proposal for recognizing gestures used in human-robot interaction, we decided to use GRIT (Gesture Commands for Robot inTeraction), a dataset comprised of gestures, distributed in nine distinct classes. Unlike the Montalbano dataset, this one was released with the gestures already segmented. Furthermore, as it was created to be used for interaction with robots, their gestures are quite separable. Figure 8 illustrates a representation of gestures contained in the dataset. Additionally, Figure 9 illustrates the Star RGB representation of a sample of each one of its nine gestures.

Figure 8: Representation of the nine gestures in the GRIT dataset.
(a) Abort
(b) Circle
(c) Hello
(d) No!
(e) Stop
(f) Warn
(g) Turn left
(h) Turn
(i) Turn Right
Figure 9: The Star RGB representation of one sample of each gesture of the GRIT dataset (tsironi2017analysis, ).

4.3 Implementation and Training

The proposed architecture was implemented using pytorch, an open source software developed by

Facebook’s artificial-intelligence research group


for machine learning proposal 

(pytorch, ), in its version .

The computer used in the experiments had he following configuration: (i) Operating system Linux Ubuntu Server, distribution 16.04; (ii) Intel Core i7-7700 processor, 3.60GHz with 4 physical cores; (iii) GB of RAM; (iv) 1TB of storage unit (hard drive); (v) Nvidia Titan V graphic card, with of dedicated memory.

The following hyper-parameters were used in the training step: batch size of and

for Montalbano and GRIT datasets, respectively; maximum number of epochs was

; learning rate of for the CNNs, and for the fully connected classifier and the soft-attention mechanism, both decreasing at each epoch; dropout keep probability of ; and the Adam optimizer algorithm. Also, some techniques of data augmentation were used. Specifically, for the Montalbano dataset, it was applied a random crop of size pixels, random horizontal flip, random rotation with , and random Gaussian noise with and . On the other hand, for the GRIT dataset, during the training, it was applied only a random crop.

Finally, in the training step, an early stop strategy was adopted. So, the training was executed until an average accuracy over the last epochs greater than was achieved.

Due to implementation compatibility, it was necessary to resize each frame. Hence, in the training step, each frame was resized to pixels before data augmentation, and in the test step, each frame was cropped at the center point, resulting in an image of pixels.

4.4 Performance measurement

For multiclass classification problems, it is common to analyze the result using the accuracy metric. In the context of gesture recognition, the accuracy rate is calculated as the number of gestures classified correctly divided by the total number of gestures on the test dataset.

Therefore, for Montalbano dataset, the accuracy metrics achieved by each class will be presented in tables. Besides that, in order to better visualize the behavior of the classifier, the results will be also described using a confusion matrix, which will have one row for each ground truth class and one column for each predicted class. Thus, the value of each cell of the array (row

, column ) will indicate the amount of gestures belonging to the class was classified as class .

To better compare results, for GRIT dataset, we will run five hold-out experiments. This is because the experiments were done following the original procedure described in (tsironi2017analysis, ). So, at each one of the five rounds, the dataset will be shuffled and divided into two subsets: for training and for test. Thus, rather than using a confusion matrix and accuracy, it will be used a table with the metrics accuracy, recall, precision and F1-Score for all classes, calculated as the average of each metric.

5 Results and discussion

In this section, we present the results obtained for both datasets: Montalbano and GRIT.

5.1 Results for Montalbano Dataset

After training by applying the setup described previously, the model was run over the test dataset, obtaining an accuracy of .

Gesture Acc(%) Gesture Acc(%)
fame 99.46 cosatifarei 94.68
cheduepalle 99.42 prendere 94.02
combinato 98.91 buonissimo 93.82
daccordo 98.77 seipazzo 92.97
tantotempo 98.27 chevuoi 92.42
basta 97.24 vieniqui 92.31
sonostufo 97.14 freganiente 91.76
messidaccordo 96.11 vattene 88.76
furbo 96.07 ok 87.36
perfetto 95.51 noncenepiu 86.63
Table 2: Result of the experiments on the Montalbano dataset.

Table 2 illustrates the accuracy obtained by each gesture and Figure 10 shows the confusion matrix of the predictions. One can notice that the majority of the gestures achieved more than of accuracy, while the gestures fame, cheduepalle, combinato, daccordo and tantotempo reached more than . In just a few cases, like for the gestures vattene, ok and noncenepiu, the accuracy was less than . However, the worst case (noncenepiu) still managed to achieve .

Figure 10: Confusion matrix of the predictions made by the model trained with the Moltalbano gesture dataset.
Gesture Resnet 50 (%) Resnet 101 (%) Ensemble (%)
fame 96.76 97.84 99.46
cheduepalle 87.86 99.42 99.42
combinato 95.11 98.91 98.91
daccordo 98.16 98.77 98.77
tantotempo 96.53 98.27 98.27
basta 98.9 97.24 97.24
sonostufo 95.43 96.57 97.14
messidaccordo 92.22 96.11 96.11
furbo 94.94 91.57 96.07
perfetto 92.13 91.57 95.51
cosatifarei 92.02 81.91 94.68
prendere 91.85 92.93 94.02
buonissimo 91.57 91.01 93.82
seipazzo 92.97 90.81 92.97
chevuoi 91.41 91.92 92.42
vieniqui 87.91 84.07 92.31
freganiente 91.76 78.82 91.76
vattene 88.76 72.47 88.76
ok 83.33 82.76 87.36
noncenepiu 79.07 81.4 86.63
Table 3: Result of the experiments on the Montalbano dataset using each Resnet individually and combining them through the soft-attention ensemble.

The original architecture was modified to demonstrate the impact of the soft-attention ensemble in the performance of the gesture classifier. To do that, we removed the soft-attention fusion, and the convolution layer of each Resnet was directly sent to the full connected layers. Hence, the gesture classifier using only Resnet achieved % of accuracy, meanwhile Resnet achieved %. Table 3 shows the results for both experiments.

Notice that, for the gestures fame, cheduepalle, combinato, tantotempo, and messidaccordo, Resnet performed worse than Resnet . However in other cases, for instance vattene, freganiente and cosatifarei Resnet was much better than Resnet . Nevertheless for all gestures, including the ensemble step allowed to get the best accuracy, except for the gesture basta, which Resnet 50 achieved .

Thus, regarding the number of classes of the problem and the use of only RGB information to represent and recognize dynamic gestures, we consider the results obtained here competitive with other approaches that used multimodal data. Furthermore, this was one of the best results for classifying gestures on that dataset, beating even other works that employ more than one source of information, for instance, (Joshi2017, ), (CongqiCao2015, ) and (Escobedo-Cardenas2015, ). In this way, we believe that the results obtained here can be considered part of the state-of-the-art for dynamic gestures recognition based only on RGB images, for this dataset.

5.2 Results for GRIT Dataset

After carrying out the five hold-out experiments, their results can be seen in Table 4. As expected, this proposal achieves better results on the GRIT dataset, even outperforms the authors’ results in (tsironi2017analysis, ). The improvement was more than in all metrics: accuracy, precision, recall, and F1-Score. Furthermore, our best result was of accuracy against achieved by the authors. This result shows how the Star RGB improves dynamic gesture representation and can contribute to the dynamic gesture recognition research field.

Metric Tsironi et al. (tsironi2017analysis, ) Ours
Table 4: Comparing results of our proposa against the results from (tsironi2017analysis, ) using GRIT dataset.

5.3 General Comments

Although the results are promising, some issues should be considered. This technique must be used in an environment where the cameras are static or try to minimize the relative movement between the background and the person. Therefore, when there is a moving camera, a possible solution could be to estimate the homography between two sequential images, from detected background features, and calculate the camera motion, as proposed by 

(iDT, ) for action recognition. Knowing how the camera movements may allow to better isolate the real movement of the person related to the gesture and then perform its recognition.

Also, as observed in (tsironi2017analysis, ), techniques like ours have the property of losing hand details. Thus, gestures that are dependent on the shape of the hands, and have the same movement of other gestures, could generate false-positives. To illustrate the effect of this constraint, a technique of information visualization, also known as saliency map (Cam++ (cam_pp, )), was used with some gestures of the class noncenepiu, which achieved the worst results by the classifier.

In general, the saliency maps of the gesture noncenepiu (see Figure 11) are similar to most of the maps of the gestures it has been confused with (see the row of the confusion matrix in Figure10). These gestures are: ok, freganiente, basta and prendere.

From the videos, it is possible to see that the gesture noncenepiu is done twisting the arm over the elbow, with the thumb and pointer finger forming an “L” (the hand shape can be seen in Figures 12a and 12e). Therefore, when isolating this movement, for some individuals, it is very similar to the movement from the gestures ok, prendere, freganiente and other ones. In this case, gesture recognition could be improved if the approach was capable of recognizing the hand details, instead of just the movements.

Figure 11 also shows how each CNN extracts different feature maps related to the input frame, meanwhile the soft-attention ensemble captures the essential information from each one. Other interesting observation is about the gesture basta, which was confused with the gesture noncenepiu when the user performs it with the two arms (see Figures 11-i, j, k, l). As discussed before, the movements are very similar even though the hand shapes are different from each other (see Figures 12 e, f).

(a) Resnet 50
(b) Resnet 101
(c) Ensemble - Ok ()
(d) Ensemble - Noncenepiu ()
(e) Resnet 50
(f) Resnet 101
(g) Ensemble - Freganiente ()
(h) Ensemble - Noncenepiu ()
(i) Resnet 50
(j) Resnet 101
(k) Ensemble - Basta ()
(l) Ensemble - Noncenepiu ()
(m) Resnet 50
(n) Resnet 101
(o) Ensemble - Prendere ()
(p) Ensemble - Noncenepiu ()
Figure 11: Salient map of the some gestures of the class noncenepiu misclassified as another one.
(a) noncenepiu - one hand
(b) ok
(c) freganient
(d) freganient
(e) noncenepiu - two hands
(f) basta
Figure 12: A sample of the hand shape of same gestures.

6 Conclusions and Future Work

Considering the importance of dynamic gesture recognition for HMI and also the problem of recognizing gestures using just color information, this work reports two contributions: (i) an approach called Star RGB representation, which describes a video clip containing a dynamic gesture in only an RGB image; (ii) a dynamic gesture classifier based in two pre-trained Resnets, a soft-attention ensemble followed by a set of fully connected layers. The experiments were carried out on both Montalbano and GRIT datasets, achieving an accuracy of for Montalbano dataset and a mean accuracy of over five randomly holdout experiments for the GRIT dataset. The obtained results show that the Star RGB, used with the soft-attention ensemble, outperforms previous works, such as  (tsironi2017analysis, )(LiChuankun2017, )(Joshi2017, )(CongqiCao2015, )(Wang2017, ), and achieve the state-of-the-art.

We suggest the following tasks as future works to improve the proposed approach. Firstly, to develop a new classifier architecture that takes into account the hand information of each gesture. Secondly, do apply this proposal in a robotic environment. For this, it will be necessary to develop a spotting gesture algorithm that can operate in real-time. Then, using the proposal architecture, gestures of a new desired vocabulary could be recognized, allowing interaction with the robot.


The authors would like to acknowledge the support from CAPES (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) through the scholarship given to the first author, as well as acknowledge the support from NVIDIA Corporation through the donation of the Titan V GPU used in this research.



  • (1) D. McNeill, Hand and mind: What gestures reveal about thought, University of Chicago press, 1992 (1992).
  • (2) S. Mitra, T. Acharya, Gesture recognition: A survey, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 37 (3) (2007) 311–324 (2007). doi:10.1109/TSMCC.2007.893280.
  • (3) S. S. Rautaray, A. Agrawal, Vision based hand gesture recognition for human computer interaction: a survey, Artificial Intelligence Review 43 (1) (2015) 1–54 (2015). doi:10.1007/s10462-012-9356-9.
  • (4) S. Saikia, S. Saharia, A survey on vision-based dynamic gesture recognition, International Journal of Computer Applications 138 (1) (2016). doi:10.5120/ijca2016908655.
  • (5) N. Neverova, C. Wolf, G. Taylor, F. Nebout, Moddrop: adaptive multi-modal gesture recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (8) (2016) 1692–1706 (2016). doi:10.1109/TPAMI.2015.2461544.
  • (6) S. Escalera, V. Athitsos, I. Guyon, Challenges in multi-modal gesture recognition, in: Gesture Recognition, Springer, 2017, pp. 1–60 (2017). doi:10.1007/978-3-319-57021-1_1.
  • (7) P. Barros, G. I. Parisi, D. Jirak, S. Wermter, Real-time gesture recognition using a humanoid robot with a deep neural architecture, in: Humanoid Robots (Humanoids), 2014 14th IEEE-RAS International Conference on, IEEE, 2014, pp. 646–651 (2014). doi:10.1109/HUMANOIDS.2014.7041431.
  • (8) W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F. E. Alsaadi, A survey of deep neural network architectures and their applications, Neurocomputing 234 (2017) 11–26 (2017). doi:10.1016/j.neucom.2016.12.038.
  • (9) Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M. S. Lew, Deep learning for visual understanding: A review, Neurocomputing 187 (2016) 27–48 (2016). doi:10.1016/j.neucom.2015.09.116.
  • (10) S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition, IEEE transactions on pattern analysis and machine intelligence 35 (1) (2013) 221–231 (2013). doi:10.1109/TPAMI.2012.59.
  • (11)

    K. Hara, H. Kataoka, Y. Satoh, Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?, in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555 (2018).

  • (12) K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in: Advances in neural information processing systems, 2014, pp. 568–576 (2014).
  • (13) C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1933–1941 (2016). doi:10.1109/CVPR.2016.213.
  • (14) J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, in: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308 (2017). doi:10.1109/CVPR.2017.502.
  • (15) S. Escalera, J. Gonzàlez, X. Baró, M. Reyes, O. Lopes, I. Guyon, V. Athitsos, H. Escalante, Multi-modal gesture recognition challenge 2013: Dataset and results, in: Proceedings of the 15th ACM on International conference on multimodal interaction, ACM, 2013, pp. 445–452 (2013). doi:10.1145/2522848.2532595.
  • (16) S. Escalera, X. Baró, J. Gonzalez, M. A. Bautista, M. Madadi, M. Reyes, V. Ponce-López, H. J. Escalante, J. Shotton, I. Guyon, Chalearn looking at people challenge 2014: Dataset and results, in: Workshop at the European Conference on Computer Vision, Springer, 2014, pp. 459–473 (2014). doi:10.1007/978-3-319-16178-5_32.
  • (17) E. Efthimiou, S. E. Fotinea, T. Goulas, M. Koutsombogera, P. Karioris, A. Vacalopoulou, I. Rodomagoulakis, P. Maragos, C. Tzafestas, V. Pitsikalis, Y. Koumpouros, A. Karavasili, P. Siavelis, F. Koureta, D. Alexopoulou, The MOBOT rollator human-robot interaction model and user evaluation process, 2016 IEEE Symposium Series on Computational Intelligence (SSCI) (2016) 1–8 (2016). doi:10.1109/SSCI.2016.7850061.
  • (18) C. Li, Y. Hou, P. Wang, W. Li, Joint distance maps based action recognition with convolutional neural networks, IEEE Signal Processing Letters 24 (5) (2017) 624–628 (2017). doi:10.1109/LSP.2017.2678539.
  • (19)

    A. Joshi, C. Monnier, M. Betke, S. Sclaroff, Comparing random forest approaches to segmenting and classifying gestures, Image and Vision Computing 58 (2017) 86–95 (2017).

  • (20)

    H. Wang, L. Wang, Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 499–508 (2017).

  • (21) N. Neverova, C. Wolf, G. W. Taylor, F. Nebout, Multi-scale deep learning for gesture detection and localization, in: Workshop at the European conference on computer vision, Springer, 2014, pp. 474–490 (2014). doi:10.1007/978-3-319-16178-5_33.
  • (22) G. Pavlakos, S. Theodorakis, V. Pitsikalis, A. Katsamanis, P. Maragos, Kinect-based multimodal gesture recognition using a two-pass fusion scheme, in: 2014 IEEE International Conference on Image Processing (ICIP), IEEE, 2014, pp. 1495–1499 (2014). doi:10.1109/ICIP.2014.7025299.
  • (23) L. Pigou, A. Van Den Oord, S. Dieleman, M. Van Herreweghe, J. Dambre, Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video, International Journal of Computer Vision 126 (2-4) (2018) 430–439 (2018). doi:10.1007/s11263-016-0957-7.
  • (24)

    S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8) (1997) 1735–1780 (1997).

  • (25) X. Liu, G. Zhao, 3d skeletal gesture recognition via sparse coding of time-warping invariant riemannian trajectories, in: International Conference on Multimedia Modeling, Springer, 2019, pp. 678–690 (2019). doi:10.29007/xhfp.
  • (26) X. Liu, H. Shi, X. Hong, H. Chen, D. Tao, G. Zhao, Hidden states exploration for 3d skeleton-based gesture recognition, in: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2019, pp. 1846–1855 (2019). doi:10.1109/WACV.2019.00201.
  • (27) X. Chen, M. Koskela, Using appearance-based hand features for dynamic RGB-D gesture recognition, Proceedings - International Conference on Pattern Recognition (2014) 411–416 (2014). doi:10.1109/ICPR.2014.79.
  • (28) A. Yao, L. V. Gool, P. Kohli, Gesture recognition portfolios for personalization, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2014) 1923–1930 (2014). doi:10.1109/CVPR.2014.247.
  • (29) D. Wu, L. Shao, Leveraging hierarchical parametric networks for skeletal joints based action segmentation and recognition, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (2014) 724–731 (2014). doi:10.1109/CVPR.2014.98.
  • (30) B. Fernando, E. Gavves, M. José Oramas, A. Ghodrati, T. Tuytelaars, Modeling video evolution for action recognition, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 07-12-June (2015) 5378–5387 (2015). doi:10.1109/CVPR.2015.7299176.
  • (31) E. Escobedo-Cardenas, G. Camara-Chavez, A robust gesture recognition using hand local data and skeleton trajectory, in: 2015 IEEE International Conference on Image Processing (ICIP), IEEE, 2015, pp. 1240–1244 (sep 2015). doi:10.1109/ICIP.2015.7350998.
  • (32) D. Wu, L. Pigou, P. J. Kindermans, N. D. H. Le, L. Shao, J. Dambre, J. M. Odobez, Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (8) (2016) 1583–1597 (2016). doi:10.1109/TPAMI.2016.2537340.
  • (33) C. Cao, Y. Zhang, H. Lu, Multi-modal learning for gesture recognition, in: 2015 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2015, pp. 1–6 (2015). doi:10.1109/ICME.2015.7177460.
  • (34) A. F. Bobick, J. W. Davis, The recognition of human movement using temporal templates, IEEE Transactions on Pattern Analysis & Machine Intelligence (3) (2001) 257–267 (2001). doi:10.1109/34.910878.
  • (35) E. Tsironi, P. Barros, C. Weber, S. Wermter, An analysis of convolutional long short-term memory recurrent neural networks for gesture recognition, Neurocomputing 268 (2017) 76–86 (2017). doi:10.1016/j.neucom.2016.12.088.
  • (36) J. L. A. Samatelo, E. O. T. Salles, A new change detection algorithm for visual surveillance system, IEEE Latin America Transactions 10 (1) (2012) 1221–1226 (2012). doi:10.1109/TLA.2012.6142465.
  • (37) K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778 (2016). doi:10.1109/CVPR.2016.90.
  • (38) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (3) (2015) 211–252 (2015). doi:10.1007/s11263-015-0816-y.
  • (39)

    V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann machines, in: Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814 (2010).

  • (40) S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167 (2015).
  • (41) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15 (1) (2014) 1929–1958 (2014).
  • (42) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch (2017).
  • (43) H. Wang, C. Schmid, Action recognition with improved trajectories, in: Proceedings of the IEEE international conference on computer vision, 2013, pp. 3551–3558 (2013). doi:10.1109/ICCV.2013.441.
  • (44) A. Chattopadhay, A. Sarkar, P. Howlader, V. N. Balasubramanian, Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks, in: Applications of Computer Vision (WACV), 2018 IEEE Winter Conference on, IEEE, 2018, pp. 839–847 (2018). doi:10.1109/WACV.2018.00097.