DeepAI
Log In Sign Up

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

12/31/2020
by   Wei Li, et al.
0

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text pairs). In this work, we propose a unified-modal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections can be utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs. As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations. Moreover, the textual knowledge and visual knowledge can enhance each other in the unified semantic space. The experimental results show that UNIMO significantly improves the performance of several single-modal and multi-modal downstream tasks.

READ FULL TEXT
07/01/2021

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross...
06/17/2022

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

Existing vision-language pre-training (VLP) methods primarily rely on pa...
03/17/2022

UNIMO-2: End-to-End Unified Vision-Language Grounded Learning

Vision-Language Pre-training (VLP) has achieved impressive performance o...
10/17/2022

Contrastive Language-Image Pre-Training with Knowledge Graphs

Recent years have witnessed the fast development of large-scale pre-trai...
06/11/2022

A Unified Continuous Learning Framework for Multi-modal Knowledge Discovery and Pre-training

Multi-modal pre-training and knowledge discovery are two important resea...
09/05/2022

Design of the topology for contrastive visual-textual alignment

Pre-training weakly related image-text pairs in the contrastive style sh...
09/10/2021

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

While large scale pre-training has achieved great achievements in bridgi...

Code Repositories

UNIMO

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning


view repo