Log In Sign Up

Can Temporal Information Help with Contrastive Self-Supervised Learning?

by   Yutong Bai, et al.

Leveraging temporal information has been regarded as essential for developing video understanding models. However, how to properly incorporate temporal information into the recent successful instance discrimination based contrastive self-supervised learning (CSL) framework remains unclear. As an intuitive solution, we find that directly applying temporal augmentations does not help, or even impair video CSL in general. This counter-intuitive observation motivates us to re-design existing video CSL frameworks, for better integration of temporal knowledge. To this end, we present Temporal-aware Contrastive self-supervised learningTaCo, as a general paradigm to enhance video CSL. Specifically, TaCo selects a set of temporal transformations not only as strong data augmentation but also to constitute extra self-supervision for video understanding. By jointly contrasting instances with enriched temporal transformations and learning these transformations as self-supervised signals, TaCo can significantly enhance unsupervised video representation learning. For instance, TaCo demonstrates consistent improvement in downstream classification tasks over a list of backbones and CSL approaches. Our best model achieves 85.1 (UCF-101) and 51.6 improvement over the previous state-of-the-art.


page 3

page 7


Time-Equivariant Contrastive Video Representation Learning

We introduce a novel self-supervised contrastive learning method to lear...

Composable Augmentation Encoding for Video Representation Learning

We focus on contrastive methods for self-supervised video representation...

Using Navigational Information to Learn Visual Representations

Children learn to build a visual representation of the world from unsupe...

iBoot: Image-bootstrapped Self-Supervised Video Representation Learning

Learning visual representations through self-supervision is an extremely...

The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning

Contrastive learning of auditory and visual perception has been extremel...

TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition

Recognizing transformation types applied to a video clip (RecogTrans) is...

Self-supervised classification of dynamic obstacles using the temporal information provided by videos

Nowadays, autonomous driving systems can detect, segment, and classify t...