UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

03/17/2022
by   Wei Li, et al.
0

Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks. However, most existing methods can only learn from aligned image-caption data and rely heavily on expensive regional features, which greatly limits their scalability and performance. In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning on both aligned image-caption data and unaligned image-only and text-only corpus. We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts. In particular, we propose to conduct grounded learning on both images and texts via a sharing grounded space, which helps bridge unaligned images and texts, and align the visual and textual semantic spaces on different types of corpora. The experiments show that our grounded learning method can improve textual and visual semantic alignment for improving performance on various cross-modal tasks. Moreover, benefiting from effective joint modeling of different types of corpora, our model also achieves impressive performance on single-modal visual and textual tasks. Our code and models are public at the UNIMO project page https://unimo-ptm.github.io/.

READ FULL TEXT

page 3

page 13

research
12/31/2020

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Existed pre-training methods either focus on single-modal tasks or multi...
research
09/04/2021

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation

Pre-training visual and textual representations from large-scale image-t...
research
10/06/2020

Learning to Represent Image and Text with Denotation Graph

Learning to fuse vision and language information and representing them i...
research
04/11/2019

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for lear...
research
06/29/2023

Unified Language Representation for Question Answering over Text, Tables, and Images

When trying to answer complex questions, people often rely on multiple s...
research
02/17/2023

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Medical vision-and-language pre-training (Med-VLP) has shown promising i...
research
08/08/2021

OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning

We introduce the task of open-vocabulary visual instance search (OVIS). ...

Please sign up or login with your details

Forgot password? Click here to reset