Spatio-Temporal Self-Attention Network for Video Saliency Prediction

08/24/2021
by   Ziqiang Wang, et al.
6

3D convolutional neural networks have achieved promising results for video tasks in computer vision, including video saliency prediction that is explored in this paper. However, 3D convolution encodes visual representation merely on fixed local spacetime according to its kernel size, while human attention is always attracted by relational visual features at different time of a video. To overcome this limitation, we propose a novel Spatio-Temporal Self-Attention 3D Network (STSANet) for video saliency prediction, in which multiple Spatio-Temporal Self-Attention (STSA) modules are employed at different levels of 3D convolutional backbone to directly capture long-range relations between spatio-temporal features of different time steps. Besides, we propose an Attentional Multi-Scale Fusion (AMSF) module to integrate multi-level features with the perception of context in semantic and spatio-temporal subspaces. Extensive experiments demonstrate the contributions of key components of our method, and the results on DHF1K, Hollywood-2, UCF, and DIEM benchmark datasets clearly prove the superiority of the proposed model compared with all state-of-the-art models.

READ FULL TEXT

page 1

page 3

page 4

page 8

research
12/04/2021

STJLA: A Multi-Context Aware Spatio-Temporal Joint Linear Attention Network for Traffic Forecasting

Traffic prediction has gradually attracted the attention of researchers ...
research
12/08/2021

STAF: A Spatio-Temporal Attention Fusion Network for Few-shot Video Classification

We propose STAF, a Spatio-Temporal Attention Fusion network for few-shot...
research
12/18/2019

Self-Attention Network for Skeleton-based Human Action Recognition

Skeleton-based action recognition has recently attracted a lot of attent...
research
04/18/2020

Attention, please: A Spatio-temporal Transformer for 3D Human Motion Prediction

In this paper, we propose a novel architecture for the task of 3D human ...
research
01/09/2020

STAViS: Spatio-Temporal AudioVisual Saliency Network

We introduce STAViS, a spatio-temporal audiovisual saliency network that...
research
03/11/2023

SPOTR: Spatio-temporal Pose Transformers for Human Motion Prediction

3D human motion prediction is a research area of high significance and a...
research
11/02/2021

Relational Self-Attention: What's Missing in Attention for Video Understanding

Convolution has been arguably the most important feature transform for m...

Please sign up or login with your details

Forgot password? Click here to reset