Progression Modelling for Online and Early Gesture Detection

09/14/2019 ∙ by Vikram Gupta, et al. ∙ Daimler AG IIT Bombay 0

Online and Early detection of gestures is crucial for building touchless gesture based interfaces. These interfaces should operate on a stream of video frames instead of the complete video and detect the presence of gestures at an earlier stage than post-completion for providing real time user experience. To achieve this, it is important to recognize the progression of the gesture across different stages so that appropriate responses can be triggered on reaching the desired execution stage. To address this, we propose a simple yet effective multi-task learning framework which models the progression of the gesture along with frame level recognition. The proposed framework recognizes the gestures at an early stage with high precision and also achieves state-of-the-art recognition accuracy of 87.8 accuracy of 88.4 and advances the state-of-the-art by more than 4 segmented annotations for the NVIDIA gesture dataset and setup a strong baseline for gesture localization for this dataset. We also evaluate our framework on the Montalbano dataset and report competitive results.



There are no comments yet.


page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gestures are, arguably, the oldest form of human communication. Unlike spoken language, they are easier to learn for people belonging to different demographics. Gestures are crucial for people with hearing or speech impairments and are also effective in noisy conditions. These properties make gestures suitable for designing universal and robust interfaces for touchless human computer interaction (HCI). These interfaces can be used to design intelligent car interiors to enable convenient user interaction with multimedia systems, reading lights, sunroof etc. without distracting the driver. Such interfaces are also suitable for designing immersive gaming and augmented reality experiences.

While gestures are intuitive and easy to learn, gesture recognition is a challenging task as there are different ways and velocities at which people can perform gestures. Variations in the ambient conditions like lighting, background, occlusion further increase the complexity. Gesture recognition systems designed for interactive applications should also address: online operation and early prediction. These systems should work in an online setting, where the gesture recognition is done on an incoming stream of video frames instead of the complete video. They should also respond in real time as a response time of more than 100 ms degrades the user experience [14][3]. To address this, it is important to recognize and predict the gesture earlier than its completion so that appropriate response can be triggered in real time. Different stages of the gesture can be used to trigger early prediction, but it is difficult to define this optimally at the training time as it is guided both by the domain requirements as well as the early prediction vs precision trade-off characteristics of the model. We propose that modelling the complete temporal progression of the gesture along with recognizing the gesture at frame level addresses the above problems and helps to design interactive gesture recognition systems. Our method can be used across different type of gestures as it does not depend on explicit sub-gesture level annotations or information about the structure of the gestures.

Figure 1:

Overall schema of our framework. We use a branched architecture with 3DCNN and GRU as spatiotemporal feature extractor and estimate the gesture progression levels and gesture category by GPM and Classification module respectively.

A lot of research has been conducted in the field of gesture recognition. However, most of these approaches do not address online operation and early prediction. Gesture recognition approaches proposed by Narayana  [16], Miao  [13] operate in an offline setting where the recognition is done after the gesture has finished. Localization based approaches as explored by Pigou  [19]

perform online frame level gesture classification but can not be used for early prediction as the gesture progression is not modelled. This makes it difficult to decide when to trigger the response. Simple heuristics like using the number of frames as the threshold for triggering prediction do not work well as gestures are of different duration. ”Clockwise” gesture in NVIDIA gesture dataset

[15] has a mean duration of 0.8 seconds while ”one finger tap” has 0.4 seconds. Even the same gesture can be performed at varying speeds. Duration of ”swipe right” gesture in the dataset ranges from 0.35 to 1 second.

Molchanov  [15] explored early gesture detection using connectionist temporal classification (CTC) [8]

. CTC loss function enables gesture detection without requiring frame level annotations which makes it useful as annotation is time consuming and expensive. However, the system learns to detect only a segment of the gesture instead of the complete gesture. Moreover, the location or duration of this segment can not be changed, which makes it difficult to adapt the gesture predictions to meet the domain requirements.

In this work, we propose a simple yet effective multitask learning framework to address online operation and early prediction. Our framework consists of two sub-modules which operate simultaneously on every frame: classification module and gesture progression modelling (GPM) module. The classification module learns to recognize the category of the gesture and the GPM module models the progression level of the gesture. GPM allows the system to perform early prediction and also provides the flexibility to configure the early prediction stages of the gestures even after the model has been trained. This flexibility is desirable as it saves the time and efforts for training separate models for different early prediction stages. The proposed framework is generic and works well in both online and offline settings.

We performed extensive quantitative and qualitative experiments on the NVIDIA and Montalbano gesture datasets to demonstrate that our framework is able to recognize the gestures early with high accuracy and also performs simultaneous classification and detection of gestures. The experiments show that the GPM branch models the progression of the gesture and also improves the offline detection accuracy. We outperform the state-of-the-art results in offline gesture detection on the NVIDIA gesture dataset and report competitive results on the Montalbano dataset.

Since our goal is to recognize the gesture category and progression accurately at frame level granularity, start and end point annotations are required. However, the NVIDIA dataset annotations are loosely segmented containing background frames also. To bridge this gap, we re-annotated the NVIDIA dataset. The new annotations will be made public. With these tight annotations, we also setup a localization baseline over this dataset. In summary, the key contributions of our paper are:

  • [noitemsep]

  • A novel multitask framework for online and early gesture recognition. The framework also demonstrates competitive performance on offline gesture classification and localization.

  • A new state-of-the-art result on offline gesture detection on the NVIDIA dataset which is closer to the human accuracy of 88.4% [15].

  • Strongly segmented gesture annotations for NVIDIA dataset for future research and a new localization baseline on the NVIDIA dataset.

2 Related Work

Classification of dynamic hand gestures has been explored extensively by the research community [2],  [4][20]

. Majority of the approaches today leverage deep 2D/3D convolutional neural networks (CNN) and recurrent units for modelling spatiotemporal information and Support Vector Machines(SVM)/Neural Network(NN) based classifiers for the classification. Miao  

[13] used a combination of residual and C3D model for extracting features from the multi-modal gesture data. The extracted features of the different modalities are fused with a canonical correlation analysis and classified using a SVM. Narayana  [16] reported state-of-the-art gesture recognition results on ChaLearn IsoGD [24] and NVIDIA dataset [15] by introducing multiple spatial channels for each modality. The complete image and the crops corresponding to the left and right hand are treated as separate channels so that the model can focus on the hands along with the complete image. The features of these channels are fused using a sparse network to avoid overfitting and classified using a SVM. While these approaches demonstrate promising results, they are mainly designed for offline gesture classification where the task is to classify the gestures after completion.

Gesture localization approaches that perform frame level classification without processing the whole gesture are more suitable for online gesture classification. Pigou  [19] explored frame level gesture classification on the Montalbano dataset [7] using a deep neural network consisting of temporal convolutions and recurrent units. Neverova  [17] demonstrated gesture localization by treating classification and localization as two different tasks. The frame level classifications are post processed by a localization module which is a binary classifier to distinguish between gesture and no-gesture. On similar lines, Wang  [25] explore a two stage process where they localize the presence of the gesture based on the motion information and then create a depth motion map for the identified segment for classification. However, this two stage process makes this method unsuitable for online gesture classification. While these approaches output predictions for every frame without observing the whole gesture, they do not model the progression of the gesture which makes it difficult to determine the frame at which the appropriate response should be fired. Due to this limitation, the above methods can not be directly used for early gesture recognition.

Molchanov  [15] studied early gesture detection by using CTC as the loss function to detect the nucleus of the gesture. However, their method does not detect the complete gesture and can not be used for prediction at any other stages apart from the nucleus.

Early event detection has been explored in the literature but with limited focus as compared to classification and detection. Hoai  [10] used a Structured Output SVM (SOSVM) to identify the events from partial observations. Temporal segments encompassing the event partially or completely are used as positive samples and remaining segments are used as negative samples to train the SOSVM to distinguish between the background and the event. The authors also constrain the classifier to output higher confidence score as it observes more of the event. Ma  [12]

propose a new ranking loss function to encourage the model to become more confident as the activity progresses. The loss function constrains the model to output monotonically non-decreasing detection score for the correct category or the margin between the correct category and the category with highest probability. Aliakbarian  

[1] also propose a new loss function which applies monotonically increasing penalty to the model for incorrect predictions along with cross entropy loss to discourage the model from generating false positives as it observes more of the activity. Although, these approaches help the model in predicting the activity at an early stage, they do not model the progression of the activity separately.

Inspired by the above approaches, we propose a novel multitask framework which addresses early and online gesture classification by modelling gesture classification and progression separately into two different branches. Explicit modelling of the gesture progression provides the flexibility to specify the gesture trigger points even after training. We demonstrate that our framework performs better early gesture prediction as well as offline classification and localization.

Figure 2:

Architecture of the spatiotemporal encoder. All the 3D convolution kernels are 3 x 3 x 3 with the denoted number of filter maps and are followed by batch normalization. All the pooling layers preserve the temporal dimension and have the kernel size of 1 x 2 x 2. Both the linear layers have 2048 units and the Gated recurrent unit (GRU) has 1024 units. ReLU is used as the activation function.

3 Proposed Method

Our architecture consists of three core components: Spatiotemporal Encoder (), Gesture Progression Modelling (GPM) module () and Classification module () as shown in Figure 1.

The input to our framework at time is a stream of frames and the output is , where .

is the GPM prediction for a particular frame at time . is the predicted gesture category at time where , where is the number of gesture classes and one for the no-gesture class.

3.1 Spatiotemporal Encoder

The Spatiotemporal encoder forms the backbone of our architecture. The goal of this module is to extract rich spatial and temporal features from the raw input video frames suitable for gesture recognition. The resulting features encode both the appearance and motion information present in the video frames. The encoder maps the current frame to the spatiotemporal features: where and . is the dimension of the feature map.

3DCNNs have shown promise in extracting local and short term spatiotemporal features from a sequence of frames [23][21]

. Therefore, we leverage a 3DCNN network for extracting features from the raw video frames. Spatial max pooling is used to reduce the spatial resolution but the features are not pooled temporally for maintaining frame level granularity. The output of the 3DCNN is connected to two linear layers.

While 3DCNNs are effective in modelling the short term dependencies, recurrent units like Gated Recurrent Unit (GRU) have been proposed for capturing the long term dependencies by [5]. The output of the 3DCNN network is fed to GRU. GRU takes the features from the 3DCNN network and the hidden state representation from the previous time step as the input and outputs the features for the current video frame. We use the following formulation for the GRU:



are the features extracted from the 3DCNN and

is the output of the encoder network. and represent the output of update and reset gates at time respectively and are the learned parameters. These features are used as input to the GPM and classification modules as explained in the next sections.

3.2 Gesture Progression Modelling (GPM)

The Gesture Progression Modelling (GPM) module models the temporal progression of the gestures at frame level granularity. It regresses the feature embedding into the progression value. where . In this work, we use the elapsed duration as a measure of gesture progression. The elapsed duration is normalized by the duration of the gesture to accommodate gestures of varying length. If a gesture starts at frame and ends at frame , the GPM value at time is defined as:


is set to zero for background frames. This module enables our framework to do reliable early gesture detection as it predicts the completion ratio as the gesture moves towards completion. Our method allows the flexibility to configure different stages of prediction for every gesture. It also provides the option of modifying the gesture trigger points even after the model is trained. This saves gesture re-annotation as well as model retraining efforts.

3.3 Gesture Classification

The objective of the classification module is to identify the category of the gesture. The module is expected to distinguish among the gestures as well as the no-gesture class. To model this, we add an extra category representing the no-gesture to the existing gesture categories.

where where is the number of gesture classes and one for the no-gesture class. We train this module with a weighted cross entropy loss to balance the learning between gesture classes and no-gesture class, where the weights are inversely proportional to the number of class samples.

3.4 Loss function

We jointly train the GPM branch with a mean square loss, and the classification branch with the weighted cross entropy loss, . is defined as:


where, and are the predicted and ground truth gesture progression values at time . is defined as:


The weights are inversely proportional to the frequency of the gesture category at time . is the predicted probability corresponding to the ground truth category. The final objective for training the network is given by:


where, is the hyper-parameter for weighting the respective losses.

Figure 3: The Neo-Nvidia annotations accurately localize the start and end frames of the gestures. In this figure, we show the annotations for an instance of ”swipe-up” gesture performed by a subject. Unlike the existing annotations, the Neo-Nvidia annotations are strongly segmented and do not contain the background or no-gesture frames.

3.5 Gesture Inference

The inference strategy of our method is based on the level of gesture progression. In offline setting, the gesture prediction is triggered at the peak of the predicted GPM curve which also represents that the gesture has completed. For online setting, we trigger gesture detection when the GPM output exceeds a predefined threshold. In both the settings, the probability vector corresponding to the detected location is used for the classification of the gesture.

4 Implementation Details

4.1 Architecture

The Spatiotemporal encoder consists of a 3DCNN and GRU network to extract spatiotemporal features from the raw video frames as shown in Figure 2. Conv3D represents a 3D convolution layer with kernel size 3 x 3 x 3 and denoted feature maps, Linear(n) is a fully connected linear layer of n output units and MaxPool3D is a max pooling layer. All the 3D convolution layers are followed by batch normalization [11] to speedup training. All the max pooling layers use the kernel size of 1 x 2 x 2 to retain temporal resolution. ReLU is used as the activation function after every convolutional and linear layer. The output of the 3DCNN is connected to a GRU with 1024 units.

The GPM Module consists of a linear layer followed by a sigmoid activation to constrain the outputs between 0 and 1 and is trained with a mean square loss . The gesture classification module consists of a single linear layer with softmax activation and outputs the class probabilities for each frame.

4.2 Training

Each video is subsampled into 80 frames by using nearest neighbour sampling and resized to 160 x 120 spatial dimensions. We crop the frames to 112 x 112 at random spatial position during training and use the center crop during inference. We use stochastic gradient descent with a learning rate of 0.001 and reduce it by a factor of 10 after every 100 epochs. Momentum of 0.9 and weight decay of 0.005 is used. We observed that the value of

for weighting the losses between the two branches works best for having a balanced learning among the branches. To avoid overfitting, we randomly perturb the video frames with spatial rotation (), spatial scaling (%), temporal scaling (%), non-linear temporal scaling and temporal translation (

frames). For every video, a random value is sampled from a uniform distribution with these intervals. We further avoid overfitting by following every 3D convolution with a volumetric dropout 

[22] and linear layers by linear dropout [9]. Volumetric dropout helps the model by promoting independence between feature maps. The dropout probabilities are set at 0.1 for 3DCNN layers and 0.85 for linear layers. We also clip the gradients (-10,10) to avoid gradient explosion  [18]. Our framework is written in torch7 [6] and trained on NVIDIA Titan X GPUs.

We trained our model on depth modality first and used it to initialize weights for other modalities after appropriate inflation. For example, we inflated the first layer from 1 channel to 3 channels for RGB modality and 1 channel to 2 channels for optical flow. While inflating to color, we divide the weights by 3 and for flow we divide them by 2 for normalizing the activations.

5 NVIDIA Dataset

5.1 Dataset

The NVIDIA gesture dataset consists of dynamic hand gestures collected in a car simulator under different lighting conditions. The gesture categories are focussed towards designing human computer interaction (HCI) interfaces which makes it an important benchmark for online gesture analysis. The dataset consists of 25 dynamic gesture classes like hand and finger swipes in different directions, pointing index finger, moving two finger in the clockwise or anti clockwise direction etc. A total of 20 subjects participated in the data collection campaign, resulting in a dataset of 1532 video samples. The dataset has multiple modalities: Color, Depth and pair of IR streams and is split into 1050 training videos and 482 test videos. We perform extensive experiments on this dataset and compare our framework with the baseline and current state-of-the-art on this dataset [15].

5.2 Neo-NVIDIA Annotations

The videos of the NVIDIA dataset are weakly annotated as the annotated start and end frames also contain the background or no-gesture frames. This makes it unsuitable for our approach and in general gesture localization tasks. To overcome this, we annotated the NVIDIA gesture dataset with exact gesture start and end boundaries. The frame at which the subject begins to execute the gesture is marked as the starting frame and the frame at which gesture is completed is marked as the end frame as shown in Figure 3. A team of experienced annotators annotated the dataset by observing the depth and color videos. Every video was annotated and reviewed by multiple annotators to maintain the quality of annotation. We will release these new annotations 111 for advancing the research in this domain.

0.25 0.50 0.75
Depth 6.9 89.5 1.7 49.5 0.3 12.5
Color 11.3 91.2 1.8 43.6 0.3 7.9
Flow 11.1 92.5 1.9 45.6 0.2 7.1
IR 11.6 84.3 1.4 33.7 0.2 6.1
Table 1: True Positive Rate (TPR) and False Positive Rate (FPR) across different Normalized Time to Detect (NTtD) values on the NVIDIA dataset for different modalities.
Modality Depth Flow Color IR Fusion
AUC 95.1 94.3 92.9 89.3 95.2
Table 2: Area under the Curve (AUC) of the Receiver Operating Characteristics (ROC) curve on different modalities on the NVIDIA dataset. ”Fusion” represents AUC after fusing all the modalities.
Figure 4: Normalized Time to Detect (NTtD) vs False Positive Rate (FPR) on the NVIDIA dataset for depth, color, flow and IR modality.

6 NVIDIA Dataset Experiments

6.1 Early and Online Gesture Recognition

In online setting, the framework processes an incoming stream of frames and outputs the classification and progression predictions for each frame. We select a detection threshold value and when the GPM output exceeds the selected threshold, the class probability scores of the corresponding frame are used to determine the predicted gesture class. We compute the Normalized Time To Detect (NTtD) [10] to measure the performance of our system for early prediction. NTtD is defined as the ratio of event duration that the detector observes before the event prediction. We report the False Positive Rate (FPR) and True Positive Rate (TPR) across different mean NTtD values for the correctly recognized gestures. TPR is defined as the ratio of correctly predicted gesture frames to the total gesture frames and FPR is the ratio of incorrectly classified gesture frames to the no-gesture frames.

Our framework is able to recognize gestures by processing only 25% of their duration with a low FPR and high TPR as shown in Table 1. In Figure 4, we plot the detailed FPR vs NTtD characteristics and observe that the FPR is inversely proportional to the NTtD, which is expected as the model becomes more confident on observing longer durations of the gesture.

To analyze the detection performance of our system across different detection thresholds, we also plot the receiver operating characteristic (ROC) curve [10] of TPR and FPR at different threshold values and report the area under the curve (AUC) for different modalities and their fusion in Table 2.

Modality Ours Molchanov et al.
IR 68.7 63.5
Color 75.9 74.1
Flow 78.2 77.8
Depth 85.5 80.3
IR Disparity (ID) - 57.8
Flow + Color 80.3 79.3
Depth + Flow 85.5 82.4
Depth + Color 86.1 -
Depth + Color + Flow 86.3 81.5
Depth + Color + Flow + IR 87.8 83.4
Depth + Color + Flow + IR + ID - 83.8
Human Accuracy 88.4
Table 3: Comparison of Offline Classification accuracy (%) of the proposed method with [15] for different modalities and their fusion on the NVIDIA gesture dataset. [15] report a human accuracy of 88.4% on this dataset.
Figure 5: Plot of the model predictions vs time for gestures in the test set. The first two rows correspond to the Gesture Progression Module and the last two are for the classification branch. The second peak is an example of a failure case in which the GPM and classification module fail to model the progression of the gesture.

6.2 Offline Gesture Recognition

In Table 3, we compare the offline performance of our method with [15] for different modalities and combinations. Our method achieves state-of-the-art accuracy on the NVIDIA dataset and further approaches human level accuracy. Fig. 5 depicts the ground truths and predictions of the GPM and classification module. We can observe that both the GPM and classification modules trigger at similar time frames for the successful cases and fail to align in the failure cases (second peak in the plot). In our experiments, we observe that depth modality outperforms other modalities, which can be explained by the fact that depth data is less sensitive to ambient conditions like lighting, background noise, etc. We use a simple weighted average strategy over the conditional probabilities to combine the predictions of different input modalities. The ensemble weights were estimated through a linear classifier trained on training data.

Modality Jaccard Index
Depth 0.60
Flow 0.54
Color 0.53
IR 0.47
Depth + Color + Flow + IR 0.61
Table 4: Localization results on the NVIDIA dataset. The Jaccard index indicates the mean overlap between predictions and the ground truth across gesture categories.

6.3 Gesture Localization

Our tightly segmented Neo-NVIDIA annotations also allow us to perform gesture localization on this dataset. We provide a benchmark localization performance in Table 4. To the best of our knowledge, we are the first to do this on the NVIDIA dataset. We compute the Intersection over Union (IOU) of gesture detections and ground truth and report the mean Jaccard Index. Jaccard Index is the standard metric for localization task and has been used by [19][15].

6.4 Ablation Studies

6.4.1 Spatiotemporal Encoder Architecture

We evaluate the contribution of 3D convolutions and recurrent units in Table 5 and observe deterioration in the performance by using a linear aggregator of 3DCNN features when compared with GRU based recurrent units. This is expected since the linear network can not model long term temporal information. We further evaluate the architecture in which we use 2DCNN as the feature extractor and model temporal information using a GRU network and observe a decrease in the classification accuracy. From this analysis, we conclude that both 3DCNN and GRU components, are independently crucial to the state-of-the-art performance of our network.

Architecture 2DCNN-GRU 3DCNN-Linear 3DCNN-GRU
Acc (%) 77.4 81.5 85.5
Table 5:  Offline Classification accuracy(%) of our approach under different architecture settings of the Spatiotemporal Encoder. Results are reported on depth modality.

6.4.2 Gesture Progression Modeling

In Table 6, we study the efficacy of the GPM in detecting the gesture correctly and at the correct location. We define as the consensus set of frames which participate in voting for the final category prediction. In the baseline setting, GPM branch is not used and global voting is done [19]. In this setting, the consensus set is where is the number of frames in the video. In the next settings, we use the GPM branch to choose the consensus set for classification at different thresholds. Formally, the consensus set where is our ratio and } is the maximum gesture progression level predicted by the GPM. We include results with various values in Table 6. We observe the best accuracy when using 100% progression level which is identical to selecting the frame with maximum progression value. This analysis shows that the GPM branch is able to accurately predict the completion of the gesture within the gesture ground truth. An explanation for inferior performance of simple voting is that it can not handle the false positives caused due to unintentional hand movements. In noisy videos, such false positives can dominate the actual gesture frames. GPM solves this problem by allowing the model to focus on relevant gesture frames.

Threshold Depth Color
Baseline (Global Voting) 84.2 74.7
GPM @ 75% 84.7 75.9
GPM @ 85% 84.9 75.5
GPM @ 95% 84.9 75.5
GPM @ 100% 85.5 75.9
Table 6:  Offline Classification accuracy (%) at different threshold ratios to the maximum GPM prediction in the video for depth and color modality. In Baseline setting, the GPM branch is not used and global voting is performed for classification.
Approach Modality Acc
[19] Color+Depth+Skeleton 97.2
[15] Color+Depth+Flow 98.2
Ours Depth 95.3
Ours Color 96.8
Ours Flow 94.6
Ours Color+Depth+Flow 97.7
Table 7: Results on Montalbano dataset. Comparison of Offline Classification accuracy (%) of the proposed method with the state-of-the-art on pre-segmented videos for different modalities and fusion.

7 Montalbano Dataset Experiments

7.1 Dataset

The Montalbano dataset [7] is a large dataset of around 14K gestures belonging to 20 categories and performed by 27 subjects under varying conditions. The videos were collected using Microsoft Kinect and have color, depth and skeletal information. Multiple gestures are present in each video and for each gesture, along with the gesture category, the start and end frame have also been annotated. We conduct experiments on the Montalbano dataset to comprehensively compare our method with the early gesture detection baseline [15] and demonstrate that our method achieves competitive results on this dataset also.

Modality Jaccard Index
Depth 0.89
Flow 0.87
Color 0.90
Depth + Color + Flow 0.91
Table 8: Localization results on the Montalbano dataset. The Jaccard index indicates the mean overlap between predictions and the ground truth across gesture categories.

7.2 Experimental Results

For early gesture recognition, we report TPR and FPR of 83% and 5.6% respectively for the NTtD of 20% on color modality. For measuring the offline gesture recognition performance, we compare our results with [19][15] in Table 7 across different modalities and achieve competitive results. We also measure the localization performance of our method and report Jaccard Index as per Table 8 for all the modalities and achieve comparable result of 0.91 as reported by [19].

8 Conclusion

Early and online detection of gestures is important for designing responsive and real time gesture based interfaces. In this work, we proposed a multitask learning framework that models the progression of the gesture (GPM) along with frame level classification for performing early gesture detection. The proposed framework works well on both online and offline settings. In online setting, our method is able to detect gestures before completion with high True Positive Rate (TPR) and low False Positive Rate (FPR). For offline gesture detection, we outperform the state-of-the-art accuracy on the NVIDIA dataset and report competitive results on Montalbano dataset. To further the research, we contribute a new set of tightly segmented annotations for the NVIDIA dataset and setup a new localization baseline.

Acknowledgement: We would like to thank Shuaib Ahmed, Mallikarjun BR, Neha Tarigopula and Sanath Narayan for their valuable feedbacks and discussions. We gratefully acknowledge Brijesh Pillai and Partha Bhattacharya at Mercedes-Benz Research and Development India, Bangalore for providing the funding and infrastructure for this work.


  • [1] M. S. Aliakbarian, F. S. Saleh, M. Salzmann, B. Fernando, L. Petersson, and L. Andersson. Encouraging lstms to anticipate actions very early.

    2017 IEEE International Conference on Computer Vision (ICCV)

    , Oct 2017.
  • [2] M. Asadi-Aghbolaghi, A. Clapes, M. Bellantonio, H. J. Escalante, V. Ponce-López, X. Baró, I. Guyon, S. Kasaei, and S. Escalera.

    A survey on deep learning based approaches for action and gesture recognition in image sequences.

    In 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017), pages 476–483. IEEE, 2017.
  • [3] S. K. Card, G. G. Robertson, and J. D. Mackinlay. The information visualizer, an information workspace. In Proceedings of the SIGCHI Conference on Human factors in computing systems, pages 181–186. ACM, 1991.
  • [4] H. Cheng, L. Yang, and Z. Liu. Survey on 3d hand gesture recognition. IEEE Transactions on Circuits and Systems for Video Technology, 26(9):1659–1673, 2016.
  • [5] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
  • [6] R. Collobert, K. Kavukcuoglu, and C. Farabet.

    Torch7: A matlab-like environment for machine learning.

    In Proceedings of the NIPS, 2011.
  • [7] S. Escalera, X. Baró, J. Gonzalez, M. A. Bautista, M. Madadi, M. Reyes, V. Ponce-López, H. J. Escalante, J. Shotton, and I. Guyon. Chalearn looking at people challenge 2014: Dataset and results. In ECCV, 2014.
  • [8] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber.

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.

    In Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM, 2006.
  • [9] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  • [10] M. Hoai and F. De la Torre. Max-margin early event detectors. International Journal of Computer Vision, 107(2):191–202, 2014.
  • [11] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [12] S. Ma, L. Sigal, and S. Sclaroff. Learning activity progression in lstms for activity detection and early detection. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1942–1950, 2016.
  • [13] Q. Miao, Y. Li, W. Ouyang, Z. Ma, X. Xu, W. Shi, and X. Cao. Multimodal gesture recognition based on the resc3d network. In Proceedings of the IEEE International Conference on Computer Vision, pages 3047–3055, 2017.
  • [14] S. Mitra and T. Acharya. Gesture recognition: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 37(3):311–324, 2007.
  • [15] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4207–4215, 2016.
  • [16] P. Narayana, R. Beveridge, and B. A. Draper. Gesture recognition: Focus on the hands. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5235–5244, 2018.
  • [17] N. Neverova, C. Wolf, G. Taylor, and F. Nebout. Moddrop: adaptive multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1692–1706, 2016.
  • [18] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318, 2013.
  • [19] L. Pigou, A. Van Den Oord, S. Dieleman, M. Van Herreweghe, and J. Dambre. Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video. International Journal of Computer Vision, 126(2-4):430–439, 2018.
  • [20] S. S. Rautaray and A. Agrawal. Vision based hand gesture recognition for human computer interaction: a survey. Artificial Intelligence Review, 43(1):1–54, 2015.
  • [21] Z. Shou, D. Wang, and S.-F. Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1049–1058, 2016.
  • [22] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 648–656, 2015.
  • [23] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  • [24] J. Wan, Y. Zhao, S. Zhou, I. Guyon, S. Escalera, and S. Z. Li. Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 56–64, 2016.
  • [25] P. Wang, W. Li, S. Liu, Y. Zhang, Z. Gao, and P. Ogunbona. Large-scale continuous gesture recognition using convolutional neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pages 13–18. IEEE, 2016.