Self-supervised Temporal Discriminative Learning for Video Representation Learning

08/05/2020
by   Jinpeng Wang, et al.
0

Temporal cues in videos provide important information for recognizing actions accurately. However, temporal-discriminative features can hardly be extracted without using an annotated large-scale video action dataset for training. This paper proposes a novel Video-based Temporal-Discriminative Learning (VTDL) framework in self-supervised manner. Without labelled data for network pretraining, temporal triplet is generated for each anchor video by using segment of the same or different time interval so as to enhance the capacity for temporal feature representation. Measuring temporal information by time derivative, Temporal Consistent Augmentation (TCA) is designed to ensure that the time derivative (in any order) of the augmented positive is invariant except for a scaling constant. Finally, temporal-discriminative features are learnt by minimizing the distance between each anchor and its augmented positive, while the distance between each anchor and its augmented negative as well as other videos saved in the memory bank is maximized to enrich the representation diversity. In the downstream action recognition task, the proposed method significantly outperforms existing related works. Surprisingly, the proposed self-supervised approach is better than fully-supervised methods on UCF101 and HMDB51 when a small-scale video dataset (with only thousands of videos) is used for pre-training. The code has been made publicly available on https://github.com/FingerRec/Self-Supervised-Temporal-Discriminative-Representation-Learning-for-Video-Action-Recognition.

READ FULL TEXT

page 1

page 4

page 8

research
11/21/2016

Self-Supervised Video Representation Learning With Odd-One-Out Networks

We propose a new self-supervised CNN pre-training technique based on a n...
research
07/08/2021

Video 3D Sampling for Self-supervised Representation Learning

Most of the existing video self-supervised methods mainly leverage tempo...
research
09/27/2022

Video-based estimation of pain indicators in dogs

Dog owners are typically capable of recognizing behavioral cues that rev...
research
08/31/2020

Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

This paper proposes a novel pretext task to address the self-supervised ...
research
09/23/2022

Leveraging Self-Supervised Training for Unintentional Action Recognition

Unintentional actions are rare occurrences that are difficult to define ...
research
05/06/2023

Transform-Equivariant Consistency Learning for Temporal Sentence Grounding

This paper addresses the temporal sentence grounding (TSG). Although exi...
research
08/19/2022

Self-Supervised Visual Place Recognition by Mining Temporal and Feature Neighborhoods

Visual place recognition (VPR) using deep networks has achieved state-of...

Please sign up or login with your details

Forgot password? Click here to reset