Mode Approximation Makes Good Vision-Language Prompts

05/15/2023
by   Haixin Wang, et al.
0

With the advance of large-scale model technologies, parameter-efficient transfer learning (PETL) has swept across various fields of Artificial Intelligence. Its core idea is to adapt the model to downstream tasks using only a small number of parameters. Recently, some studies have applied these techniques proven effective to multimodal tasks. However, two critical issues remain unresolved: how to further reduce the complexity with lightweight design and how to boost alignment between modalities under extremely low parameters. In this paper, we propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate few trainable parameters to implement the multi-modal prompt tuning, which explores the low intrinsic dimension with only 0.05 model. Then, to better narrow the modality gap, we propose the informative context enhancement and gated query transformation modules under extremely few parameters scenes. A thorough evaluation of the Aurora on six cross-modal downstream benchmarks shows that it not only outperforms the state-of-the-art, but even outperforms the full fine-tuning approach. Our code is available at: https://github.com/WillDreamer/Aurora.

READ FULL TEXT
research
05/18/2023

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Multi-modal large language models are regarded as a crucial step towards...
research
03/20/2023

Visual Prompt Multi-Modal Tracking

Visible-modal object tracking gives rise to a series of downstream multi...
research
11/23/2022

VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

Many recent studies leverage the pre-trained CLIP for text-video cross-m...
research
05/29/2023

Deeply Coupled Cross-Modal Prompt Learning

Recent advancements in multimodal foundation models (e.g., CLIP) have ex...
research
07/21/2023

Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation

Parameter Efficient Tuning (PET) has gained attention for reducing the n...
research
08/18/2023

VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

As the model size of pre-trained language models (PLMs) grows rapidly, f...
research
02/22/2023

Modular Deep Learning

Transfer learning has recently become the dominant paradigm of machine l...

Please sign up or login with your details

Forgot password? Click here to reset