A Multi-Modal Transformer Network for Action Detection

05/31/2023
by   Matthew Korban, et al.
0

This paper proposes a novel multi-modal transformer network for detecting actions in untrimmed videos. To enrich the action features, our transformer network utilizes a new multi-modal attention mechanism that computes the correlations between different spatial and motion modalities combinations. Exploring such correlations for actions has not been attempted previously. To use the motion and spatial modality more effectively, we suggest an algorithm that corrects the motion distortion caused by camera movement. Such motion distortion, common in untrimmed videos, severely reduces the expressive power of motion features such as optical flow fields. Our proposed algorithm outperforms the state-of-the-art methods on two public benchmarks, THUMOS14 and ActivityNet. We also conducted comparative experiments on our new instructional activity dataset, including a large set of challenging classroom videos captured from elementary schools.

READ FULL TEXT

page 3

page 4

page 9

page 18

page 22

page 24

research
10/23/2022

Anticipative Feature Fusion Transformer for Multi-Modal Action Anticipation

Although human action anticipation is a task which is inherently multi-m...
research
08/20/2021

MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

This paper presents a pure transformer-based approach, dubbed the Multi-...
research
02/07/2022

CZU-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and 10 wearable inertial sensors

Human action recognition has been widely used in many fields of life, an...
research
05/29/2019

What Makes Training Multi-Modal Networks Hard?

Consider end-to-end training of a multi-modal vs. a single-modal network...
research
09/21/2022

Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos

During recent years transformers architectures have been growing in popu...
research
09/02/2021

SlowFast Rolling-Unrolling LSTMs for Action Anticipation in Egocentric Videos

Action anticipation in egocentric videos is a difficult task due to the ...
research
01/29/2020

BUDD: Multi-modal Bayesian Updating Deforestation Detections

The global phenomenon of forest degradation is a pressing issue with sev...

Please sign up or login with your details

Forgot password? Click here to reset