GTA: Global Temporal Attention for Video Action Understanding

12/15/2020
by   Bo He, et al.
0

Self-attention learns pairwise interactions via dot products to model long-range dependencies, yielding great improvements for video action recognition. In this paper, we seek a deeper understanding of self-attention for temporal modeling in videos. In particular, we demonstrate that the entangled modeling of spatial-temporal information by flattening all pixels is sub-optimal, failing to capture temporal relationships among frames explicitly. We introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. Unlike conventional self-attention that computes an instance-specific attention matrix, GTA randomly initializes a global attention matrix that is intended to learn stable temporal structures to generalize across different samples. GTA is further augmented with a cross-channel multi-head fashion to exploit feature interactions for better temporal modeling. We apply GTA not only on pixels but also on semantically similar regions identified automatically by a learned transformation matrix. Extensive experiments on 2D and 3D networks demonstrate that our approach consistently enhances the temporal modeling and provides state-of-the-art performance on three video action recognition datasets.

READ FULL TEXT

page 6

page 7

page 12

research
05/27/2021

SSAN: Separable Self-Attention Network for Video Representation Learning

Self-attention has been successfully applied to video representation lea...
research
04/01/2022

Vision Transformer with Cross-attention by Temporal Shift for Efficient Action Recognition

We propose Multi-head Self/Cross-Attention (MSCA), which introduces a te...
research
07/19/2021

Action Forecasting with Feature-wise Self-Attention

We present a new architecture for human action forecasting from videos. ...
research
12/02/2021

Stacked Temporal Attention: Improving First-person Action Recognition by Emphasizing Discriminative Clips

First-person action recognition is a challenging task in video understan...
research
04/18/2023

GlobalMind: Global Multi-head Interactive Self-attention Network for Hyperspectral Change Detection

High spectral resolution imagery of the Earth's surface enables users to...
research
08/04/2019

Action Recognition in Untrimmed Videos with Composite Self-Attention Two-Stream Framework

With the rapid development of deep learning algorithms, action recogniti...
research
02/16/2021

Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries

We present EgoACO, a deep neural architecture for video action recogniti...

Please sign up or login with your details

Forgot password? Click here to reset