Towards Versatile and Efficient Visual Knowledge Injection into Pre-trained Language Models with Cross-Modal Adapters

05/12/2023
by   Xinyun Zhang, et al.
0

Humans learn language via multi-modal knowledge. However, due to the text-only pre-training scheme, most existing pre-trained language models (PLMs) are hindered from the multi-modal information. To inject visual knowledge into PLMs, existing methods incorporate either the text or image encoder of vision-language models (VLMs) to encode the visual information and update all the original parameters of PLMs for knowledge fusion. In this paper, we propose a new plug-and-play module, X-adapter, to flexibly leverage the aligned visual and textual knowledge learned in pre-trained VLMs and efficiently inject them into PLMs. Specifically, we insert X-adapters into PLMs, and only the added parameters are updated during adaptation. To fully exploit the potential in VLMs, X-adapters consist of two sub-modules, V-expert and T-expert, to fuse VLMs' image and text representations, respectively. We can opt for activating different sub-modules depending on the downstream tasks. Experimental results show that our method can significantly improve the performance on object-color reasoning and natural language understanding (NLU) tasks compared with PLM baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/20/2023

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

Prompt tuning, like CoOp, has recently shown promising vision recognizin...
research
03/14/2022

Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer

Pre-trained language models are still far from human performance in task...
research
04/18/2022

Imagination-Augmented Natural Language Understanding

Human brains integrate linguistic and perceptual information simultaneou...
research
05/08/2023

A Multi-Modal Context Reasoning Approach for Conditional Inference on Joint Textual and Visual Clues

Conditional inference on joint textual and visual clues is a multi-modal...
research
05/23/2023

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

Pre-trained vision-language models are the de-facto foundation models fo...
research
06/01/2023

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

Pre-trained language models (PLMs) have played an increasing role in mul...
research
09/16/2023

Delving into Multimodal Prompting for Fine-grained Visual Classification

Fine-grained visual classification (FGVC) involves categorizing fine sub...

Please sign up or login with your details

Forgot password? Click here to reset