Time Is MattEr: Temporal Self-supervision for Video Transformers

07/19/2022
by   Sukmin Yun, et al.
26

Understanding temporal dynamics of video is an essential aspect of learning better video representations. Recently, transformer-based architectural designs have been extensively explored for video tasks due to their capability to capture long-term dependency of input sequences. However, we found that these Video Transformers are still biased to learn spatial dynamics rather than temporal ones, and debiasing the spurious correlation is critical for their performance. Based on the observations, we design simple yet effective self-supervised tasks for video models to learn temporal dynamics better. Specifically, for debiasing the spatial bias, our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs. Also, our method learns the temporal flow direction of video tokens among consecutive frames for enhancing the correlation toward temporal dynamics. Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.

READ FULL TEXT

page 6

page 13

research
06/17/2021

Long-Short Temporal Contrastive Learning of Video Transformers

Video transformers have recently emerged as a competitive alternative to...
research
08/25/2023

Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers

Vision Transformers achieve impressive accuracy across a range of visual...
research
06/12/2023

Unmasking Deepfakes: Masked Autoencoding Spatiotemporal Transformers for Enhanced Video Forgery Detection

We present a novel approach for the detection of deepfake videos using a...
research
03/17/2023

Dual-path Adaptation from Image to Video Transformers

In this paper, we efficiently transfer the surpassing representation pow...
research
04/18/2023

SViTT: Temporal Learning of Sparse Video-Text Transformers

Do video-text transformers learn to model temporal relationships across ...
research
01/16/2022

Video Transformers: A Survey

Transformer models have shown great success modeling long-range interact...
research
11/17/2014

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Models based on deep convolutional networks have dominated recent image ...

Please sign up or login with your details

Forgot password? Click here to reset