Efficient Spatialtemporal Context Modeling for Action Recognition

03/20/2021
by   Congqi Cao, et al.
0

Contextual information plays an important role in action recognition. Local operations have difficulty to model the relation between two elements with a long-distance interval. However, directly modeling the contextual information between any two points brings huge cost in computation and memory, especially for action recognition, where there is an additional temporal dimension. Inspired from 2D criss-cross attention used in segmentation task, we propose a recurrent 3D criss-cross attention (RCCA-3D) module to model the dense long-range spatiotemporal contextual information in video for action recognition. The global context is factorized into sparse relation maps. We model the relationship between points in the same line along the direction of horizon, vertical and depth at each time, which forms a 3D criss-cross structure, and duplicate the same operation with recurrent mechanism to transmit the relation between points in a line to a plane finally to the whole spatiotemporal space. Compared with the non-local method, the proposed RCCA-3D module reduces the number of parameters and FLOPs by 25 context modeling. We evaluate the performance of RCCA-3D with two latest action recognition networks on three datasets and make a thorough analysis of the architecture, obtaining the best way to factorize and fuse the relation maps. Comparisons with other state-of-the-art methods demonstrate the effectiveness and efficiency of our model.

READ FULL TEXT

page 1

page 3

page 5

page 10

page 11

research
06/09/2020

PNL: Efficient Long-Range Dependencies Extraction with Pyramid Non-Local Module for Action Recognition

Long-range spatiotemporal dependencies capturing plays an essential role...
research
11/28/2018

CCNet: Criss-Cross Attention for Semantic Segmentation

Long-range dependencies can capture useful contextual information to ben...
research
04/03/2020

TEA: Temporal Excitation and Aggregation for Action Recognition

Temporal modeling is key for action recognition in videos. It normally c...
research
06/13/2018

How Predictable is Your State? Leveraging Lexical and Contextual Information for Predicting Legislative Floor Action at the State Level

Modeling U.S. Congressional legislation and roll-call votes has received...
research
11/21/2019

TEINet: Towards an Efficient Architecture for Video Recognition

Efficiency is an important issue in designing video architectures for ac...
research
12/12/2022

Cross-Modal Learning with 3D Deformable Attention for Action Recognition

An important challenge in vision-based action recognition is the embeddi...
research
10/01/2019

Action Anticipation for Collaborative Environments: The Impact of Contextual Information and Uncertainty-Based Prediction

For effectively interacting with humans in collaborative environments, m...

Please sign up or login with your details

Forgot password? Click here to reset