Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

08/16/2019
by   Gen Li, et al.
7

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM and Unicoder, both visual and linguistic contents are fed into a multi-layer transformer for the cross-modal pre-training, where three pre-trained tasks are employed, including masked language model, masked object label prediction and visual-linguistic matching. The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large amounts of image-caption pairs, we transfer Unicoder-VL to image-text retrieval tasks with just one additional output layer, and achieve state-of-the-art performances on both MSCOCO and Flicker30K.

READ FULL TEXT
research
11/09/2022

ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation

Recent cross-lingual cross-modal works attempt to extend Vision-Language...
research
01/07/2023

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching

Despite surprising performance on zero-shot transfer, pre-training a lar...
research
09/17/2020

Cross-Modal Alignment with Mixture Experts Neural Network for Intral-City Retail Recommendation

In this paper, we introduce Cross-modal Alignment with mixture experts N...
research
05/15/2020

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Recent Transformer-based large-scale pre-trained models have revolutioni...
research
07/30/2021

Structural Guidance for Transformer Language Models

Transformer-based language models pre-trained on large amounts of text d...
research
04/07/2021

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

We study joint learning of Convolutional Neural Network (CNN) and Transf...
research
12/06/2021

General Facial Representation Learning in a Visual-Linguistic Manner

How to learn a universal facial representation that boosts all face anal...

Please sign up or login with your details

Forgot password? Click here to reset