Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment

03/01/2022
by   Mingyang Zhou, et al.
3

Vision-and-Language (V+L) pre-training models have achieved tremendous success in recent years on various multi-modal benchmarks. However, the majority of existing models require pre-training on a large set of parallel image-text data, which is costly to collect, compared to image-only or text-only data. In this paper, we explore unsupervised Vision-and-Language pre-training (UVLP) to learn the cross-modal representation from non-parallel image and text datasets. We found two key factors that lead to good unsupervised V+L pre-training without parallel data: (i) joint image-and-text input (ii) overall image-text alignment (even for non-parallel data). Accordingly, we propose a novel unsupervised V+L pre-training curriculum for non-parallel texts and images. We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks, including region-to-tag, region-to-phrase, and image-to-sentence alignment, to bridge the gap between the two modalities. A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model. We adapt our pre-trained model to a set of V+L downstream tasks, including VQA, NLVR2, Visual Entailment, and RefCOCO+. Our model achieves the state-of-art performance in all these tasks under the unsupervised setting.

READ FULL TEXT

page 3

page 8

page 12

page 13

research
01/22/2020

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

In this paper, we introduce a new vision-language pre-trained model – Im...
research
05/26/2021

Unsupervised Pronoun Resolution via Masked Noun-Phrase Prediction

In this work, we propose Masked Noun-Phrase Prediction (MNPP), a pre-tra...
research
01/30/2022

VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training

Vision-and-language pre-trained models (VLMs) have achieved tremendous s...
research
01/05/2023

GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods

A key goal for the advancement of AI is to develop technologies that ser...
research
09/14/2022

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

The pre-trained image-text models, like CLIP, have demonstrated the stro...
research
03/18/2022

Graph-Text Multi-Modal Pre-training for Medical Representation Learning

As the volume of Electronic Health Records (EHR) sharply grows, there ha...
research
11/27/2022

MGDoc: Pre-training with Multi-granular Hierarchy for Document Image Understanding

Document images are a ubiquitous source of data where the text is organi...

Please sign up or login with your details

Forgot password? Click here to reset