TSI: Temporal Saliency Integration for Video Action Recognition

by   Haisheng Su, et al.

Efficient spatiotemporal modeling is an important yet challenging problem for video action recognition. Existing state-of-the-art methods exploit motion clues to assist in short-term temporal modeling through temporal difference over consecutive frames. However, background noises will be inevitably introduced due to the camera movement. Besides, movements of different actions can vary greatly. In this paper, we propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module. Specifically, SME aims to highlight the motion-sensitive area through local-global motion modeling, where the background suppression and pyramidal feature difference are conducted successively between neighboring frames to capture motion dynamics with less background noises. CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively. Meanwhile, temporal interactions across different scales are integrated with attention mechanism. Through these two modules, long short-term temporal relationships can be encoded efficiently by introducing limited additional parameters. Extensive experiments are conducted on several popular benchmarks (i.e., Something-Something v1 v2, Kinetics-400, UCF-101, and HMDB-51), which demonstrate the effectiveness and superiority of our proposed method.


page 1

page 3

page 4


Long-Short Temporal Modeling for Efficient Action Recognition

Efficient long-short temporal modeling is key for enhancing the performa...

TDN: Temporal Difference Networks for Efficient Action Recognition

Temporal modeling still remains challenging for action recognition in vi...

TEA: Temporal Excitation and Aggregation for Action Recognition

Temporal modeling is key for action recognition in videos. It normally c...

TEINet: Towards an Efficient Architecture for Video Recognition

Efficiency is an important issue in designing video architectures for ac...

TAM: Temporal Adaptive Module for Video Recognition

Temporal modeling is crucial for capturing spatiotemporal structure in v...

EAN: Event Adaptive Network for Enhanced Action Recognition

Efficiently modeling spatial-temporal information in videos is crucial f...

MGSampler: An Explainable Sampling Strategy for Video Action Recognition

Frame sampling is a fundamental problem in video action recognition due ...