Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training

08/21/2021
by   Ming Yan, et al.
0

Existing approaches to vision-language pre-training (VLP) heavily rely on an object detector based on bounding boxes (regions), where salient objects are first detected from images and then a Transformer-based model is used for cross-modal fusion. Despite their superior performance, these approaches are bounded by the capability of the object detector in terms of both effectiveness and efficiency. Besides, the presence of object detection imposes unnecessary constraints on model designs and makes it difficult to support end-to-end training. In this paper, we revisit grid-based convolutional features for vision-language pre-training, skipping the expensive region-related steps. We propose a simple yet effective grid-based VLP method that works surprisingly well with the grid features. By pre-training only with in-domain datasets, the proposed Grid-VLP method can outperform most competitive region-based VLP methods on three examined vision-language understanding tasks. We hope that our findings help to further advance the state of the art of vision-language pre-training, and provide a new direction towards effective and efficient VLP.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2021

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Vision-language pre-training (VLP) on large-scale image-text pairs has a...
research
04/13/2020

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Large-scale pre-training methods of learning cross-modal representations...
research
10/14/2022

Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training

Large-scale vision-language pre-trained (VLP) models are prone to halluc...
research
04/12/2022

X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks

In this paper, we study the challenging instance-wise vision-language ta...
research
04/07/2021

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

We study joint learning of Convolutional Neural Network (CNN) and Transf...
research
05/16/2022

ReDFeat: Recoupling Detection and Description for Multimodal Feature Learning

Deep-learning-based local feature extraction algorithms that combine det...
research
07/04/2022

Explore Faster Localization Learning For Scene Text Detection

Generally pre-training and long-time training computation are necessary ...

Please sign up or login with your details

Forgot password? Click here to reset