Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts

11/16/2021
by   Yan Zeng, et al.
8

Most existing methods in vision language pre-training rely on object-centric features extracted through object detection, and make fine-grained alignments between the extracted features and texts. We argue that the use of object detection may not be suitable for vision language pre-training. Instead, we point out that the task should be performed so that the regions of `visual concepts' mentioned in the texts are located in the images, and in the meantime alignments between texts and visual concepts are identified, where the alignments are in multi-granularity. This paper proposes a new method called X-VLM to perform `multi-grained vision language pre-training'. Experimental results show that X-VLM consistently outperforms state-of-the-art methods in many downstream vision language tasks.

READ FULL TEXT

page 1

page 8

page 12

page 13

research
08/04/2022

Fine-Grained Semantically Aligned Vision-Language Pre-Training

Large-scale vision-language pre-training has shown impressive advances i...
research
01/02/2021

VinVL: Making Visual Representations Matter in Vision-Language Models

This paper presents a detailed study of improving visual representations...
research
07/12/2022

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

Vision-Language Pre-training (VLP) with large-scale image-text pairs has...
research
08/07/2023

FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond Under Low-Light Vision

Extracting useful visual cues for the downstream tasks is especially cha...
research
04/12/2023

CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes

Training models to apply linguistic knowledge and visual concepts from 2...
research
01/05/2023

GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods

A key goal for the advancement of AI is to develop technologies that ser...
research
05/30/2022

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

Recent advances in vision-language pre-training (VLP) have demonstrated ...

Please sign up or login with your details

Forgot password? Click here to reset