MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

08/20/2021
by   Jiawei Chen, et al.
1

This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. Different from other schemes which solely utilize the decoded RGB frames, MM-ViT operates exclusively in the compressed video domain and exploits all readily available modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In order to handle the large number of spatiotemporal tokens extracted from multiple modalities, we develop several scalable model variants which factorize self-attention across the space, time and modality dimensions. In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer building block. Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy, and performs better or equally well to the state-of-the-art CNN counterparts with computationally-heavy optical flow.

READ FULL TEXT
research
01/31/2020

Modality Compensation Network: Cross-Modal Adaptation for Action Recognition

With the prevalence of RGB-D cameras, multi-modal video data have become...
research
09/30/2022

Husformer: A Multi-Modal Transformer for Multi-Modal Human State Recognition

Human state recognition is a critical topic with pervasive and important...
research
12/12/2022

Cross-Modal Learning with 3D Deformable Attention for Action Recognition

An important challenge in vision-based action recognition is the embeddi...
research
05/12/2023

MM-Fi: Multi-Modal Non-Intrusive 4D Human Dataset for Versatile Wireless Sensing

4D human perception plays an essential role in a myriad of applications,...
research
05/31/2023

A Multi-Modal Transformer Network for Action Detection

This paper proposes a novel multi-modal transformer network for detectin...
research
01/05/2021

Trear: Transformer-based RGB-D Egocentric Action Recognition

In this paper, we propose a Transformer-based RGB-D egocentric action re...
research
03/31/2022

Deformable Video Transformer

Video transformers have recently emerged as an effective alternative to ...

Please sign up or login with your details

Forgot password? Click here to reset