DeepAI AI Chat
Log In Sign Up

Prompting Visual-Language Models for Efficient Video Understanding

by   Chen Ju, et al.
University of Oxford
Shanghai Jiao Tong University

Visual-language pre-training has shown great success for learning joint visual-textual representations from large-scale web data, demonstrating remarkable ability for zero-shot generalisation. This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training, and here, we consider video understanding tasks. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert the novel tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components and necessities. On 9 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, open-set scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite training significantly fewer parameters.


BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

The cost of vision-and-language pre-training has become increasingly pro...

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

Dominant pre-training work for video-text retrieval mainly adopt the "du...

Knowledge Prompting for Few-shot Action Recognition

Few-shot action recognition in videos is challenging for its lack of sup...

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Masked visual modeling (MVM) has been recently proven effective for visu...

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

We study joint video and language (VL) pre-training to enable cross-moda...

BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions

Pre-training a model to learn transferable video-text representation for...

Multi-modal Prompting for Low-Shot Temporal Action Localization

In this paper, we consider the problem of temporal action localization u...