Hierarchical Contrastive Motion Learning for Video Action Recognition

by   Xitong Yang, et al.

One central question for video action recognition is how to model motion. In this paper, we present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw video frames. Our approach progressively learns a hierarchy of motion features that correspond to different abstraction levels in a network. This hierarchical design bridges the semantic gap between low-level motion cues and high-level recognition tasks, and promotes the fusion of appearance and motion information at multiple levels. At each level, an explicit motion self-supervision is provided via contrastive learning to enforce the motion features at the current level to predict the future ones at the previous level. Thus, the motion features at higher levels are trained to gradually capture semantic dynamics and evolve more discriminative for action recognition. Our motion learning module is lightweight and flexible to be embedded into various backbone networks. Extensive experiments on four benchmarks show that the proposed approach consistently achieves superior results.


MS^2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

In this paper, we address self-supervised representation learning from h...

Featureless: Bypassing feature extraction in action categorization

This method introduces an efficient manner of learning action categories...

Video-based Contrastive Learning on Decision Trees: from Action Recognition to Autism Diagnosis

How can we teach a computer to recognize 10,000 different actions? Deep ...

MoQuad: Motion-focused Quadruple Construction for Video Contrastive Learning

Learning effective motion features is an essential pursuit of video repr...

Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

The success of deep learning comes from its ability to capture the hiera...

MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition

Current state-of-the-art approaches for few-shot action recognition achi...

Hierarchical Explanations for Video Action Recognition

We propose Hierarchical ProtoPNet: an interpretable network that explain...

Please sign up or login with your details

Forgot password? Click here to reset