ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora

12/31/2020
by   Xuan Ouyang, et al.
8

Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance on downstream cross-lingual tasks. This improvement stems from the learning of a large amount of monolingual and parallel corpora. While it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for the low-resource languages. In this paper, we propose ERNIE-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to break the constraint of parallel corpus size on the model performance. Our key insight is to integrate the idea of back translation in the pre-training process. We generate pseudo-parallel sentences pairs on a monolingual corpus to enable the learning of semantic alignment between different languages, which enhances the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results on various cross-lingual downstream tasks. The codes and pre-trained models will be made publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/26/2022

Multi-Level Contrastive Learning for Cross-Lingual Alignment

Cross-language pre-trained models such as multilingual BERT (mBERT) have...
research
10/24/2019

Low-Resource Sequence Labeling via Unsupervised Multilingual Contextualized Representations

Previous work on cross-lingual sequence labeling tasks either requires p...
research
05/21/2022

Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

Multilingual language models such as mBERT have seen impressive cross-li...
research
09/01/2021

Aligning Cross-lingual Sentence Representations with Dual Momentum Contrast

In this paper, we propose to align sentence representations from differe...
research
04/15/2021

Bilingual Terminology Extraction from Non-Parallel E-Commerce Corpora

Bilingual terminologies are important resources for natural language pro...
research
07/16/2019

Language comparison via network topology

Modeling relations between languages can offer understanding of language...
research
05/25/2022

Language Anisotropic Cross-Lingual Model Editing

Pre-trained language models learn large amounts of knowledge from their ...

Please sign up or login with your details

Forgot password? Click here to reset