Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

08/31/2020
by   Jiangliu Wang, et al.
10

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with several 3D backbone networks, i.e., C3D, 3D-ResNet and R(2+1)D. The results show that our approach outperforms the existing approaches across the three backbone networks on various downstream video analytic tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is made publicly available at: https://github.com/laura-wang/video_repres_sts.

READ FULL TEXT

page 1

page 4

page 6

page 9

page 10

page 12

research
04/07/2019

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

We address the problem of video representation learning without human-an...
research
08/13/2020

Self-supervised Video Representation Learning by Pace Prediction

This paper addresses the problem of self-supervised video representation...
research
08/05/2020

Self-supervised Temporal Discriminative Learning for Video Representation Learning

Temporal cues in videos provide important information for recognizing ac...
research
11/24/2018

Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles

Self-supervised tasks such as colorization, inpainting and zigsaw puzzle...
research
09/26/2021

Self-Supervised Video Representation Learning by Video Incoherence Detection

This paper introduces a novel self-supervised method that leverages inco...
research
12/12/2022

Contextual Explainable Video Representation: Human Perception-based Understanding

Video understanding is a growing field and a subject of intense research...
research
11/03/2022

FactorMatte: Redefining Video Matting for Re-Composition Tasks

We propose "factor matting", an alternative formulation of the video mat...

Please sign up or login with your details

Forgot password? Click here to reset