Multi-modal Fusion for Single-Stage Continuous Gesture Recognition

11/10/2020
by   Harshala Gammulle, et al.
2

Gesture recognition is a much studied research area which has myriad real-world applications including robotics and human-machine interaction. Current gesture recognition methods have heavily focused on isolated gestures, and existing continuous gesture recognition methods are limited by a two-stage approach where independent models are required for detection and classification, with the performance of the latter being constrained by detection performance. In contrast, we introduce a single-stage continuous gesture recognition model, that can detect and classify multiple gestures in a single video via a single model. This approach learns the natural transitions between gestures and non-gestures without the need for a pre-processing segmentation stage to detect individual gestures. To enable this, we introduce a multi-modal fusion mechanism to support the integration of important information that flows from multi-modal inputs, and is scalable to any number of modes. Additionally, we propose Unimodal Feature Mapping (UFM) and Multi-modal Feature Mapping (MFM) models to map uni-modal features and the fused multi-modal features respectively. To further enhance the performance we propose a mid-point based loss function that encourages smooth alignment between the ground truth and the prediction. We demonstrate the utility of our proposed framework which can handle variable-length input videos, and outperforms the state-of-the-art on two challenging datasets, EgoGesture, and IPN hand. Furthermore, ablative experiments show the importance of different components of the proposed framework.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset