ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation

11/09/2022
by   Bin Shan, et al.
0

Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models to non-English inputs and achieve impressive performance. However, these models focus only on understanding tasks utilizing encoder-only architecture. In this paper, we propose ERNIE-UniX2, a unified cross-lingual cross-modal pre-training framework for both generation and understanding tasks. ERNIE-UniX2 integrates multiple pre-training paradigms (e.g., contrastive learning and language modeling) based on encoder-decoder architecture and attempts to learn a better joint representation across languages and modalities. Furthermore, ERNIE-UniX2 can be seamlessly fine-tuned for varieties of generation and understanding downstream tasks. Pre-trained on both multilingual text-only and image-text datasets, ERNIE-UniX2 achieves SOTA results on various cross-lingual cross-modal generation and understanding tasks such as multimodal machine translation and multilingual visual question answering.

READ FULL TEXT
research
04/01/2021

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Vision-and-language pre-training has achieved impressive success in lear...
research
05/13/2023

RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training

Multilingual vision-language (V L) pre-training has achieved remarkabl...
research
06/01/2022

Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

In this paper, we introduce Cross-View Language Modeling, a simple and e...
research
10/30/2020

VECO: Variable Encoder-decoder Pre-training for Cross-lingual Understanding and Generation

Recent studies about learning multilingual representations have achieved...
research
02/03/2022

mSLAM: Massively multilingual joint pre-training for speech and text

We present mSLAM, a multilingual Speech and LAnguage Model that learns c...
research
08/16/2019

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

We propose Unicoder-VL, a universal encoder that aims to learn joint rep...
research
08/25/2021

Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training

Translating e-commercial product descriptions, a.k.a product-oriented ma...

Please sign up or login with your details

Forgot password? Click here to reset