Static and Dynamic Concepts for Self-supervised Video Representation Learning

07/26/2022
by   Rui Qian, et al.
0

In this paper, we propose a novel learning scheme for self-supervised video representation learning. Motivated by how humans understand videos, we propose to first learn general visual concepts then attend to discriminative local areas for video understanding. Specifically, we utilize static frame and frame difference to help decouple static and dynamic concepts, and respectively align the concept distributions in latent space. We add diversity and fidelity regularizations to guarantee that we learn a compact set of meaningful concepts. Then we employ a cross-attention mechanism to aggregate detailed local features of different concepts, and filter out redundant concepts with low activations to perform local concept contrast. Extensive experiments demonstrate that our method distills meaningful static and dynamic concepts to guide video understanding, and obtains state-of-the-art results on UCF-101, HMDB-51, and Diving-48.

READ FULL TEXT

page 2

page 12

page 13

research
01/16/2021

Self-Supervised Representation Learning from Flow Equivariance

Self-supervised representation learning is able to learn semantically me...
research
04/07/2019

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

We address the problem of video representation learning without human-an...
research
12/07/2021

Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning

Despite the great progress in video understanding made by deep convoluti...
research
09/26/2021

Self-Supervised Video Representation Learning by Video Incoherence Detection

This paper introduces a novel self-supervised method that leverages inco...
research
07/12/2022

Dual Contrastive Learning for Spatio-temporal Representation

Contrastive learning has shown promising potential in self-supervised sp...
research
03/08/2019

Feel the Static and Kinetic Friction

Multimodal simulations augment the presentation of abstract concepts fac...
research
05/31/2022

D^2NeRF: Self-Supervised Decoupling of Dynamic and Static Objects from a Monocular Video

Given a monocular video, segmenting and decoupling dynamic objects while...

Please sign up or login with your details

Forgot password? Click here to reset