Retrieval-based Knowledge Augmented Vision Language Pre-training

04/27/2023
by   Jiahua Rao, et al.
0

With recent progress in large-scale vision and language representation learning, Vision Language Pretraining (VLP) models have achieved promising improvements on various multi-modal downstream tasks. Albeit powerful, these pre-training models still do not take advantage of world knowledge, which is implicit in multi-modal data but comprises abundant and complementary information. In this work, we propose a REtrieval-based knowledge Augmented Vision Language Pre-training model (REAVL), which retrieves world knowledge from knowledge graphs (KGs) and incorporates them in vision-language pre-training. REAVL has two core components: a knowledge retriever that retrieves knowledge given multi-modal data, and a knowledge-augmented model that fuses multi-modal data and knowledge. By novelly unifying four knowledge-aware self-supervised tasks, REAVL promotes the mutual integration of multi-modal data and knowledge by fusing explicit knowledge with vision-language pairs for masked multi-modal data modeling and KG relational reasoning. Empirical experiments show that REAVL achieves new state-of-the-art performance uniformly on knowledge-based vision-language understanding and multimodal entity linking tasks, and competitive results on general vision-language tasks while only using 0.2 models.

READ FULL TEXT
research
10/17/2022

Contrastive Language-Image Pre-Training with Knowledge Graphs

Recent years have witnessed the fast development of large-scale pre-trai...
research
06/16/2022

MixGen: A New Multi-Modal Data Augmentation

Data augmentation is a necessity to enhance data efficiency in deep lear...
research
11/17/2022

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

To effectively exploit the potential of large-scale models, various pre-...
research
08/23/2022

Learning More May Not Be Better: Knowledge Transferability in Vision and Language Tasks

Is more data always better to train vision-and-language models? We study...
research
12/08/2020

LAMP: Label Augmented Multimodal Pretraining

Multi-modal representation learning by pretraining has become an increas...
research
08/03/2022

GPPF: A General Perception Pre-training Framework via Sparsely Activated Multi-Task Learning

Pre-training over mixtured multi-task, multi-domain, and multi-modal dat...
research
08/20/2020

VisualSem: a high-quality knowledge graph for vision and language

We argue that the next frontier in natural language understanding (NLU) ...

Please sign up or login with your details

Forgot password? Click here to reset