Cross-Modal Adapter for Text-Video Retrieval

11/17/2022
by   Haojun Jiang, et al.
0

Text-video retrieval is an important multi-modal learning task, where the goal is to retrieve the most relevant video for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on this task. However, as pre-trained models are scaling up, fully fine-tuning them on text-video retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel Cross-Modal Adapter for parameter-efficient fine-tuning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows early cross-modal interactions between CLIP's two encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces 99.6% of fine-tuned parameters, and alleviates the problem of overfitting, (2) saves approximately 30 time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, it achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets. The code will be available at <https://github.com/LeapLabTHU/Cross-Modal-Adapter>.

READ FULL TEXT
research
09/04/2023

MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval

As the size of Large Multi-Modal Models (LMMs) increases consistently, t...
research
11/23/2022

VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

Many recent studies leverage the pre-trained CLIP for text-video cross-m...
research
06/15/2023

Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models

Video Question Answering (VideoQA) has been significantly advanced from ...
research
01/19/2023

Multimodal Video Adapter for Parameter Efficient Video Text Retrieval

State-of-the-art video-text retrieval (VTR) methods usually fully fine-t...
research
01/08/2022

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Cross-Modal Retrieval (CMR) is an important research topic across multim...
research
07/21/2023

Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation

Parameter Efficient Tuning (PET) has gained attention for reducing the n...
research
06/28/2023

ICSVR: Investigating Compositional and Semantic Understanding in Video Retrieval Models

Video retrieval (VR) involves retrieving the ground truth video from the...

Please sign up or login with your details

Forgot password? Click here to reset