Long-Short Temporal Contrastive Learning of Video Transformers

06/17/2021
by   Jue Wang, et al.
0

Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K. Since transformer-based models are effective at capturing dependencies over extended temporal spans, we propose a simple learning procedure that forces the model to match a long-term view to a short-term view of the same video. Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent. To demonstrate the generality of our findings, we implement and validate our approach under three different self-supervised contrastive learning frameworks (MoCo v3, BYOL, SimSiam) using two distinct video-transformer architectures, including an improved variant of the Swin Transformer augmented with space-time attention. We conduct a thorough ablation study and show that LSTCL achieves competitive performance on multiple video benchmarks and represents a convincing alternative to supervised image-based pretraining.

READ FULL TEXT
07/19/2022

Time Is MattEr: Temporal Self-supervision for Video Transformers

Understanding temporal dynamics of video is an essential aspect of learn...
06/03/2021

When Vision Transformers Outperform ResNets without Pretraining or Strong Data Augmentations

Vision Transformers (ViTs) and MLPs signal further efforts on replacing ...
10/26/2021

TUNet: A Block-online Bandwidth Extension Model based on Transformers and Self-supervised Pretraining

We introduce a block-online variant of the temporal feature-wise linear ...
12/02/2021

Self-supervised Video Transformer

In this paper, we propose self-supervised training for video transformer...
07/15/2022

Position Prediction as an Effective Pretraining Strategy

Transformers have gained increasing popularity in a wide range of applic...
01/16/2022

Video Transformers: A Survey

Transformer models have shown great success modeling long-range interact...
07/22/2022

Video Swin Transformers for Egocentric Video Understanding @ Ego4D Challenges 2022

We implemented Video Swin Transformer as a base architecture for the tas...