Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition

08/22/2018
by   Unaiza Ahsan, et al.
0

We propose a self-supervised learning method to jointly reason about spatial and temporal context for video recognition. Recent self-supervised approaches have used spatial context [9, 34] as well as temporal coherency [32] but a combination of the two requires extensive preprocessing such as tracking objects through millions of video frames [59] or computing optical flow to determine frame regions with high motion [30]. We propose to combine spatial and temporal context in one self-supervised framework without any heavy preprocessing. We divide multiple video frames into grids of patches and train a network to solve jigsaw puzzles on these patches from multiple frames. So the network is trained to correctly identify the position of a patch within a video frame as well as the position of a patch over time. We also propose a novel permutation strategy that outperforms random permutations while significantly reducing computational and memory constraints. We use our trained network for transfer learning tasks such as video activity recognition and demonstrate the strength of our approach on two benchmark video action recognition datasets without using a single frame from these datasets for unsupervised pretraining of our proposed video jigsaw network.

READ FULL TEXT

page 2

page 4

page 5

page 8

page 9

research
11/23/2020

Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

We present a novel way for self-supervised video representation learning...
research
05/06/2021

Unsupervised Visual Representation Learning by Tracking Patches in Video

Inspired by the fact that human eyes continue to develop tracking abilit...
research
11/14/2021

Unsupervised Action Localization Crop in Video Retargeting for 3D ConvNets

Untrimmed videos on social media or those captured by robots and surveil...
research
09/30/2022

Alignment-guided Temporal Attention for Video Action Recognition

Temporal modeling is crucial for various video learning tasks. Most rece...
research
11/19/2022

Efficient Video Representation Learning via Masked Video Modeling with Motion-centric Token Selection

Self-supervised Video Representation Learning (VRL) aims to learn transf...
research
08/08/2023

Temporal DINO: A Self-supervised Video Strategy to Enhance Action Prediction

The emerging field of action prediction plays a vital role in various co...
research
09/27/2022

AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition

Recent research has revealed that reducing the temporal and spatial redu...

Please sign up or login with your details

Forgot password? Click here to reset