Self-supervised Video Transformer

12/02/2021
by   Kanchana Ranasinghe, et al.
0

In this paper, we propose self-supervised training for video transformers using unlabelled video data. From a given video, we create local and global spatiotemporal views with varying spatial sizes and frame rates. Our self-supervised objective seeks to match the features of these different views representing the same video, to be invariant to spatiotemporal variations in actions. To the best of our knowledge, the proposed approach is the first to alleviate the dependency on negative samples or dedicated memory banks in Self-supervised Video Transformer (SVT). Further, owing to the flexibility of Transformer models, SVT supports slow-fast video processing within a single architecture using dynamically adjusted positional encodings and supports long-term relationship modeling along spatiotemporal dimensions. Our approach performs well on four action recognition benchmarks (Kinetics-400, UCF-101, HMDB-51, and SSv2) and converges faster with small batch sizes. Code: https://git.io/J1juJ

READ FULL TEXT

page 1

page 3

page 4

research
11/28/2018

Self-supervised Spatiotemporal Feature Learning by Video Geometric Transformations

To alleviate the expensive cost of data collection and annotation, many ...
research
11/10/2022

3D-CSL: self-supervised 3D context similarity learning for Near-Duplicate Video Retrieval

In this paper, we introduce 3D-CSL, a compact pipeline for Near-Duplicat...
research
07/27/2020

Representation Learning with Video Deep InfoMax

Self-supervised learning has made unsupervised pretraining relevant agai...
research
06/17/2021

Long-Short Temporal Contrastive Learning of Video Transformers

Video transformers have recently emerged as a competitive alternative to...
research
06/29/2023

Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train

Foundation models have exhibited remarkable success in various applicati...
research
11/17/2022

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

Learning discriminative spatiotemporal representation is the key problem...
research
03/27/2023

Exemplar-based Video Colorization with Long-term Spatiotemporal Dependency

Exemplar-based video colorization is an essential technique for applicat...

Please sign up or login with your details

Forgot password? Click here to reset