VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples

by   Tian Pan, et al.

MoCo is effective for unsupervised image representation learning. In this paper, we propose VideoMoCo for unsupervised video representation learning. Given a video sequence as an input sample, we improve the temporal feature representations of MoCo from two perspectives. First, we introduce a generator to drop out several frames from this sample temporally. The discriminator is then learned to encode similar feature representations regardless of frame removals. By adaptively dropping out different frames during training iterations of adversarial learning, we augment this input sample to train a temporally robust encoder. Second, we use temporal decay to model key attenuation in the memory queue when computing the contrastive loss. As the momentum encoder updates after keys enqueue, the representation ability of these keys degrades when we use the current input sample for contrastive learning. This degradation is reflected via temporal decay to attend the input sample to recent keys in the queue. As a result, we adapt MoCo to learn video representations without empirically designing pretext tasks. By empowering the temporal robustness of the encoder and modeling the temporal decay of the keys, our VideoMoCo improves MoCo temporally based on contrastive learning. Experiments on benchmark datasets including UCF101 and HMDB51 show that VideoMoCo stands as a state-of-the-art video representation learning method.



page 4

page 5

page 6


Motion-Focused Contrastive Learning of Video Representations

Motion, as the most distinct phenomenon in a video to involve the change...

Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition

Action recognition via 3D skeleton data is an emerging important topic i...

PreViTS: Contrastive Pretraining with Video Tracking Supervision

Videos are a rich source for self-supervised learning (SSL) of visual re...

Probabilistic Representations for Video Contrastive Learning

This paper presents Probabilistic Video Contrastive Learning, a self-sup...

Unsupervised Learning of Spatiotemporally Coherent Metrics

Current state-of-the-art classification and detection algorithms rely on...

Unsupervised Feature Learning from Temporal Data

Current state-of-the-art classification and detection algorithms rely on...

Unsupervised Learning of Disentangled Representations from Video

We present a new model DrNET that learns disentangled image representati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.