Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

04/07/2021
by   Zhicheng Huang, et al.
0

We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions of image-text pairs. State-of-the-art approaches extract salient image regions and align regions with words step-by-step. As region-based visual features usually represent parts of an image, it is challenging for existing vision-language models to fully understand the semantics from paired natural languages. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches. In particular, SOHO learns to extract comprehensive yet compact image features through a visual dictionary (VD) that facilitates cross-modal understanding. VD is designed to represent consistent visual abstractions of similar semantics. It is updated on-the-fly and utilized in our proposed pre-training task Masked Visual Modeling (MVM). We conduct experiments on four well-established vision-language tasks by following standard VLPT settings. In particular, SOHO achieves absolute gains of 2.0 retrieval 5k test split, 1.5 on SNLI-VE test split, respectively.

READ FULL TEXT

page 3

page 8

page 10

page 11

research
06/03/2021

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Vision-language pre-training (VLP) on large-scale image-text pairs has a...
research
04/13/2020

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

Large-scale pre-training methods of learning cross-modal representations...
research
08/16/2019

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

We propose Unicoder-VL, a universal encoder that aims to learn joint rep...
research
08/21/2021

Grid-VLP: Revisiting Grid Features for Vision-Language Pre-training

Existing approaches to vision-language pre-training (VLP) heavily rely o...
research
01/18/2023

Effective End-to-End Vision Language Pretraining with Semantic Visual Loss

Current vision language pretraining models are dominated by methods usin...
research
03/10/2022

Cross-modal Map Learning for Vision and Language Navigation

We consider the problem of Vision-and-Language Navigation (VLN). The maj...
research
09/01/2023

Learned Visual Features to Textual Explanations

Interpreting the learned features of vision models has posed a longstand...

Please sign up or login with your details

Forgot password? Click here to reset