Relational Self-Attention: What's Missing in Attention for Video Understanding

11/02/2021
by   Manjin Kim, et al.
0

Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1 V2, Diving48, and FineGym.

READ FULL TEXT

page 10

page 19

research
02/14/2021

Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition

Spatio-temporal convolution often fails to learn motion dynamics in vide...
research
02/09/2021

Is Space-Time Attention All You Need for Video Understanding?

We present a convolution-free approach to video classification built exc...
research
04/12/2020

Relational Learning between Multiple Pulmonary Nodules via Deep Set Attention Transformers

Diagnosis and treatment of multiple pulmonary nodules are clinically imp...
research
08/24/2021

Spatio-Temporal Self-Attention Network for Video Saliency Prediction

3D convolutional neural networks have achieved promising results for vid...
research
09/29/2020

Knowledge Fusion Transformers for Video Action Recognition

We introduce Knowledge Fusion Transformers for video action classificati...
research
07/19/2021

Action Forecasting with Feature-wise Self-Attention

We present a new architecture for human action forecasting from videos. ...
research
11/17/2020

Exploring Self-Attention for Visual Odometry

Visual odometry networks commonly use pretrained optical flow networks i...

Please sign up or login with your details

Forgot password? Click here to reset