Seeing in Flowing: Adapting CLIP for Action Recognition with Motion Prompts Learning

08/09/2023
by   Qiang Wang, et al.
0

The Contrastive Language-Image Pre-training (CLIP) has recently shown remarkable generalization on "zero-shot" training and has applied to many downstream tasks. We explore the adaptation of CLIP to achieve a more efficient and generalized action recognition method. We propose that the key lies in explicitly modeling the motion cues flowing in video frames. To that end, we design a two-stream motion modeling block to capture motion and spatial information at the same time. And then, the obtained motion cues are utilized to drive a dynamic prompts learner to generate motion-aware prompts, which contain much semantic information concerning human actions. In addition, we propose a multimodal communication block to achieve a collaborative learning and further improve the performance. We conduct extensive experiments on HMDB-51, UCF-101, and Kinetics-400 datasets. Our method outperforms most existing state-of-the-art methods by a significant margin on "few-shot" and "zero-shot" training. We also achieve competitive performance on "closed-set" training with extremely few trainable parameters and additional computational costs.

READ FULL TEXT
research
03/10/2022

Zero-Shot Action Recognition with Transformer-based Video Semantic Embedding

While video action recognition has been an active area of research for s...
research
03/06/2023

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Learning from large-scale contrastive language-image pre-training like C...
research
06/17/2022

Learning Using Privileged Information for Zero-Shot Action Recognition

Zero-Shot Action Recognition (ZSAR) aims to recognize video actions that...
research
12/08/2021

Prompting Visual-Language Models for Efficient Video Understanding

Visual-language pre-training has shown great success for learning joint ...
research
04/06/2023

Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

Adopting contrastive image-text pretrained models like CLIP towards vide...
research
07/27/2023

Sample Less, Learn More: Efficient Action Recognition via Frame Feature Restoration

Training an effective video action recognition model poses significant c...

Please sign up or login with your details

Forgot password? Click here to reset