VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

11/23/2022
by   Siteng Huang, et al.
0

Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models.In this work, we propose the VoP: Text-Video Co-operative Prompt Tuning for efficient tuning on the text-video retrieval task. The proposed VoP is an end-to-end framework with both video text prompts introducing, which can be regarded as a powerful baseline with only 0.1 characteristics of videos, we develop three novel video prompt mechanisms to improve the performance with different scales of trainable parameters. The basic idea of the VoP enhancement is to model the frame position, frame context, and layer function with specific trainable prompts, respectively. Extensive experiments show that compared to full fine-tuning, the enhanced VoP achieves a 1.4 with 6x less parameter overhead. The code will be available at https://github.com/bighuang624/VoP.

READ FULL TEXT
research
11/17/2022

Cross-Modal Adapter for Text-Video Retrieval

Text-video retrieval is an important multi-modal learning task, where th...
research
07/11/2023

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone

Video-language pre-training (VLP) has become increasingly important due ...
research
07/21/2023

Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation

Parameter Efficient Tuning (PET) has gained attention for reducing the n...
research
05/23/2023

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

The ultimate goal for foundation models is realizing task-agnostic, i.e....
research
07/19/2020

A Generic Visualization Approach for Convolutional Neural Networks

Retrieval networks are essential for searching and indexing. Compared to...
research
05/15/2023

Mode Approximation Makes Good Vision-Language Prompts

With the advance of large-scale model technologies, parameter-efficient ...
research
01/19/2023

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

State-of-the-art video-text retrieval (VTR) methods usually fully fine-t...

Please sign up or login with your details

Forgot password? Click here to reset