ActionCLIP: A New Paradigm for Video Action Recognition

09/17/2021
by   Mengmeng Wang, et al.
0

The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8 top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2021

A New Split for Evaluating True Zero-Shot Action Recognition

Zero-shot action recognition is the task of classifying action categorie...
research
08/03/2023

Multimodal Adaptation of CLIP for Few-Shot Action Recognition

Applying large-scale pre-trained visual models like CLIP to few-shot act...
research
09/29/2022

REST: REtrieve Self-Train for generative action recognition

This work is on training a generative action/video recognition model who...
research
12/03/2022

VLG: General Video Recognition with Web Textual Knowledge

Video recognition in an open and dynamic world is quite challenging, as ...
research
12/18/2021

Tell me what you see: A zero-shot action recognition method based on natural language descriptions

Recently, several approaches have explored the detection and classificat...
research
03/06/2023

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Learning from large-scale contrastive language-image pre-training like C...
research
01/11/2021

Learning from Weakly-labeled Web Videos via Exploring Sub-Concepts

Learning visual knowledge from massive weakly-labeled web videos has att...

Please sign up or login with your details

Forgot password? Click here to reset