UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

04/01/2021
by   Mingyang Zhou, et al.
2

Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC2, the first machine translation-augmented framework for cross-lingual cross-modal representation learning. To tackle the scarcity problem of multilingual captions for image datasets, we first augment existing English-only datasets with other languages via machine translation (MT). Then we extend the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (i.e, using image as pivot). To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated data. Evaluation on multilingual image-text retrieval and multilingual visual question answering benchmarks demonstrates that our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.

READ FULL TEXT
research
11/09/2022

ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation

Recent cross-lingual cross-modal works attempt to extend Vision-Language...
research
06/29/2023

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages

Vision-Language Pre-training (VLP) has advanced the performance of many ...
research
05/08/2023

Accessible Instruction-Following Agent

Humans can collaborate and complete tasks based on visual signals and in...
research
09/18/2020

COMET: A Neural Framework for MT Evaluation

We present COMET, a neural framework for training multilingual machine t...
research
12/13/2022

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Software engineers working with the same programming language (PL) may s...
research
12/15/2022

TRIP: Triangular Document-level Pre-training for Multilingual Language Models

Despite the current success of multilingual pre-training, most prior wor...
research
05/01/2020

Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage

Cross-modal language generation tasks such as image captioning are direc...

Please sign up or login with your details

Forgot password? Click here to reset