MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval

09/04/2023
by   Zijun Long, et al.
0

As the size of Large Multi-Modal Models (LMMs) increases consistently, the adaptation of these pre-trained models to specialized tasks has become a computationally and memory-intensive challenge. Traditional fine-tuning methods require isolated, exhaustive retuning for each new task, limiting the models' versatility. Moreover, current efficient adaptation techniques often overlook modality alignment, focusing only on the knowledge extraction of new tasks. To tackle these issues, we introduce Multiway-Adapter, an innovative framework incorporating an 'Alignment Enhancer' to deepen modality alignment, enabling high transferability without tuning pre-trained parameters. Our method adds fewer than 1.25% of additional parameters to LMMs, exemplified by the BEiT-3 model in our study. This leads to superior zero-shot image-text retrieval performance compared to fully fine-tuned models, while achieving up to a 57% reduction in fine-tuning time. Our approach offers a resource-efficient and effective adaptation pathway for LMMs, broadening their applicability. The source code is publicly available at: <https://github.com/longkukuhi/MultiWay-Adapter>.

READ FULL TEXT
research
11/17/2022

Cross-Modal Adapter for Text-Video Retrieval

Text-video retrieval is an important multi-modal learning task, where th...
research
03/28/2023

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

We present LLaMA-Adapter, a lightweight adaption method to efficiently f...
research
03/08/2022

Multi-Modal Mixup for Robust Fine-tuning

Pre-trained large-scale models provide a transferable embedding, and the...
research
08/06/2023

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

In recent years, 3D representation learning has turned to 2D vision-lang...
research
12/06/2022

Fine-tuned CLIP Models are Efficient Video Learners

Large-scale multi-modal training with image-text pairs imparts strong ge...
research
05/26/2022

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Although the pre-trained Vision Transformers (ViTs) achieved great succe...
research
03/23/2023

Modular Retrieval for Generalization and Interpretation

New retrieval tasks have always been emerging, thus urging the developme...

Please sign up or login with your details

Forgot password? Click here to reset