VLG: General Video Recognition with Web Textual Knowledge

12/03/2022
by   Jintao Lin, et al.
0

Video recognition in an open and dynamic world is quite challenging, as we need to handle different settings such as close-set, long-tail, few-shot and open-set. By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we focus on the general video recognition (GVR) problem of solving different recognition tasks within a unified framework. The core contribution of this paper is twofold. First, we build a comprehensive video recognition benchmark of Kinetics-GVR, including four sub-task datasets to cover the mentioned settings. To facilitate the research of GVR, we propose to utilize external textual knowledge from the Internet and provide multi-source text descriptions for all action classes. Second, inspired by the flexibility of language representation, we present a unified visual-linguistic framework (VLG) to solve the problem of GVR by an effective two-stage training paradigm. Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings. Extensive results show that our VLG obtains the state-of-the-art performance under four settings. The superior performance demonstrates the effectiveness and generalization ability of our proposed framework. We hope our work makes a step towards the general video recognition and could serve as a baseline for future research. The code and models will be available at https://github.com/MCG-NJU/VLG.

READ FULL TEXT

page 2

page 8

page 13

page 18

page 19

page 20

page 21

research
09/17/2021

ActionCLIP: A New Paradigm for Video Action Recognition

The canonical approach to video action recognition dictates a neural mod...
research
03/30/2021

Distribution Alignment: A Unified Framework for Long-tail Visual Recognition

Despite the recent success of deep neural networks, it remains challengi...
research
12/18/2021

Tell me what you see: A zero-shot action recognition method based on natural language descriptions

Recently, several approaches have explored the detection and classificat...
research
03/06/2023

CLIP-guided Prototype Modulating for Few-shot Action Recognition

Learning from large-scale contrastive language-image pre-training like C...
research
11/17/2022

CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge

Automatically generating textual descriptions for massive unlabeled imag...
research
08/03/2023

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

We present the All-Seeing (AS) project: a large-scale data and model for...
research
05/03/2023

Few-shot Event Detection: An Empirical Study and a Unified View

Few-shot event detection (ED) has been widely studied, while this brings...

Please sign up or login with your details

Forgot password? Click here to reset