A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition

by   Benjia Zhou, et al.

Motion recognition is a promising direction in computer vision, but the training of video classification models is much harder than images due to insufficient data and considerable parameters. To get around this, some works strive to explore multimodal cues from RGB-D data. Although improving motion recognition to some extent, these methods still face sub-optimal situations in the following aspects: (i) Data augmentation, i.e., the scale of the RGB-D datasets is still limited, and few efforts have been made to explore novel data augmentation strategies for videos; (ii) Optimization mechanism, i.e., the tightly space-time-entangled network structure brings more challenges to spatiotemporal information modeling; And (iii) cross-modal knowledge fusion, i.e., the high similarity between multimodal representations caused to insufficient late fusion. To alleviate these drawbacks, we propose to improve RGB-D-based motion recognition both from data and algorithm perspectives in this paper. In more detail, firstly, we introduce a novel video data augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to provide additional temporal regularization for motion recognition. Secondly, a Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed UMDR, is proposed for video representation learning. Finally, a novel cross-modal Complement Feature Catcher (CFCer) is explored to mine potential commonalities features in multimodal information as the auxiliary fusion stream, to improve the late fusion results. The seamless combination of these novel designs forms a robust spatiotemporal representation and achieves better performance than state-of-the-art methods on four public motion datasets. Specifically, UMDR achieves unprecedented improvements of +4.5 IsoGD dataset.Our code is available at https://github.com/zhoubenjia/MotionRGBD-PAMI.


A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition

The audio-video based multimodal emotion recognition has attracted a lot...

Learning Temporally Invariant and Localizable Features via Data Augmentation for Video Recognition

Deep-Learning-based video recognition has shown promising improvements a...

Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations

Learning effective joint embedding for cross-modal data has always been ...

Siamese Network for RGB-D Salient Object Detection and Beyond

Existing RGB-D salient object detection (SOD) models usually treat RGB a...

Provable Dynamic Fusion for Low-Quality Multimodal Data

The inherent challenge of multimodal fusion is to precisely capture the ...

Action Recognition Using Volumetric Motion Representations

Traditional action recognition models are constructed around the paradig...

A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

Every hour, huge amounts of visual contents are posted on social media a...

Please sign up or login with your details

Forgot password? Click here to reset