STC-mix: Space, Time, Channel mixing for Self-supervised Video Representation

12/07/2021
by   Srijan Das, et al.
0

Contrastive representation learning of videos highly relies on the availability of millions of unlabelled videos. This is practical for videos available on web but acquiring such large scale of videos for real-world applications is very expensive and laborious. Therefore, in this paper we focus on designing video augmentation for self-supervised learning, we first analyze the best strategy to mix videos to create a new augmented video sample. Then, the question remains, can we make use of the other modalities in videos for data mixing? To this end, we propose Cross-Modal Manifold Cutmix (CMMC) that inserts a video tesseract into another video tesseract in the feature space across two different modalities. We find that our video mixing strategy STC-mix, i.e. preliminary mixing of videos followed by CMMC across different modalities in a video, improves the quality of learned video representations. We conduct thorough experiments for two downstream tasks: action recognition and video retrieval on two small scale video datasets UCF101, and HMDB51. We also demonstrate the effectiveness of our STC-mix on NTU dataset where domain knowledge is limited. We show that the performance of our STC-mix on both the downstream tasks is on par with the other self-supervised approaches while requiring less training data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2018

Cross and Learn: Cross-Modal Self-Supervision

In this paper we present a self-supervised method for representation lea...
research
06/28/2020

Video Representation Learning with Visual Tempo Consistency

Visual tempo, which describes how fast an action goes, has shown its pot...
research
07/16/2022

SVGraph: Learning Semantic Graphs from Instructional Videos

In this work, we focus on generating graphical representations of noisy,...
research
11/19/2020

Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision

In this paper, we teach machines to understand visuals and natural langu...
research
01/02/2023

STEPs: Self-Supervised Key Step Extraction from Unlabeled Procedural Videos

We address the problem of extracting key steps from unlabeled procedural...
research
03/18/2021

Space-Time Crop Attend: Improving Cross-modal Video Representation Learning

The quality of the image representations obtained from self-supervised l...
research
02/18/2023

SSVMR: Saliency-based Self-training for Video-Music Retrieval

With the rise of short videos, the demand for selecting appropriate back...

Please sign up or login with your details

Forgot password? Click here to reset