Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition

02/14/2021
by   Heeseung Kwon, et al.
0

Spatio-temporal convolution often fails to learn motion dynamics in videos and thus an effective motion representation is required for video understanding in the wild. In this paper, we propose a rich and robust motion representation based on spatio-temporal self-similarity (STSS). Given a sequence of frames, STSS represents each local region as similarities to its neighbors in space and time. By converting appearance features into relational values, it enables the learner to better recognize structural patterns in space and time. We leverage the whole volume of STSS and let our model learn to extract an effective motion representation from it. The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision. With a sufficient volume of the neighborhood in space and time, it effectively captures long-term interaction and fast motion in the video, leading to robust action recognition. Our experimental analysis demonstrates its superiority over previous methods for motion modeling as well as its complementarity to spatio-temporal features from direct convolution. On the standard action recognition benchmarks, Something-Something-V1 V2, Diving-48, and FineGym, the proposed method achieves the state-of-the-art results.

READ FULL TEXT

page 1

page 15

research
06/17/2019

Spatio-Temporal Fusion Networks for Action Recognition

The video based CNN works have focused on effective ways to fuse appeara...
research
04/10/2017

ActionVLAD: Learning spatio-temporal aggregation for action classification

In this work, we introduce a new video representation for action classif...
research
11/02/2021

Relational Self-Attention: What's Missing in Attention for Video Understanding

Convolution has been arguably the most important feature transform for m...
research
10/03/2021

Spatio-Temporal Video Representation Learning for AI Based Video Playback Style Prediction

Ever-increasing smartphone-generated video content demands intelligent t...
research
03/13/2023

A generative model to synthetize spatio-temporal dynamics of biomolecules in cells

Generators of space-time dynamics in bioimaging have become essential to...
research
02/10/2020

Joint Encoding of Appearance and Motion Features with Self-supervision for First Person Action Recognition

Wearable cameras are becoming more and more popular in several applicati...
research
10/12/2022

MotionBERT: Unified Pretraining for Human Motion Analysis

We present MotionBERT, a unified pretraining framework, to tackle differ...

Please sign up or login with your details

Forgot password? Click here to reset