Long-Short Temporal Contrastive Learning of Video Transformers

06/17/2021
by   Jue Wang, et al.
0

Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K. Since transformer-based models are effective at capturing dependencies over extended temporal spans, we propose a simple learning procedure that forces the model to match a long-term view to a short-term view of the same video. Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent. To demonstrate the generality of our findings, we implement and validate our approach under three different self-supervised contrastive learning frameworks (MoCo v3, BYOL, SimSiam) using two distinct video-transformer architectures, including an improved variant of the Swin Transformer augmented with space-time attention. We conduct a thorough ablation study and show that LSTCL achieves competitive performance on multiple video benchmarks and represents a convincing alternative to supervised image-based pretraining.

READ FULL TEXT
research
09/03/2023

COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers

We present COMEDIAN, a novel pipeline to initialize spatio-temporal tran...
research
07/19/2022

Time Is MattEr: Temporal Self-supervision for Video Transformers

Understanding temporal dynamics of video is an essential aspect of learn...
research
06/03/2021

When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Vision Transformers (ViTs) and MLPs signal further efforts on replacing ...
research
07/25/2023

Pretrained Deep 2.5D Models for Efficient Predictive Modeling from Retinal OCT

In the field of medical imaging, 3D deep learning models play a crucial ...
research
01/16/2022

Video Transformers: A Survey

Transformer models have shown great success modeling long-range interact...
research
12/02/2021

Self-supervised Video Transformer

In this paper, we propose self-supervised training for video transformer...
research
08/08/2023

Temporal DINO: A Self-supervised Video Strategy to Enhance Action Prediction

The emerging field of action prediction plays a vital role in various co...

Please sign up or login with your details

Forgot password? Click here to reset