ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora

12/31/2020
by   Xuan Ouyang, et al.
8

Recent studies have demonstrated that pre-trained cross-lingual models achieve impressive performance on downstream cross-lingual tasks. This improvement stems from the learning of a large amount of monolingual and parallel corpora. While it is generally acknowledged that parallel corpora are critical for improving the model performance, existing methods are often constrained by the size of parallel corpora, especially for the low-resource languages. In this paper, we propose ERNIE-M, a new training method that encourages the model to align the representation of multiple languages with monolingual corpora, to break the constraint of parallel corpus size on the model performance. Our key insight is to integrate the idea of back translation in the pre-training process. We generate pseudo-parallel sentences pairs on a monolingual corpus to enable the learning of semantic alignment between different languages, which enhances the semantic modeling of cross-lingual models. Experimental results show that ERNIE-M outperforms existing cross-lingual models and delivers new state-of-the-art results on various cross-lingual downstream tasks. The codes and pre-trained models will be made publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

02/26/2022

Multi-Level Contrastive Learning for Cross-Lingual Alignment

Cross-language pre-trained models such as multilingual BERT (mBERT) have...
10/24/2019

Low-Resource Sequence Labeling via Unsupervised Multilingual Contextualized Representations

Previous work on cross-lingual sequence labeling tasks either requires p...
09/01/2021

Aligning Cross-lingual Sentence Representations with Dual Momentum Contrast

In this paper, we propose to align sentence representations from differe...
05/25/2022

Language Anisotropic Cross-Lingual Model Editing

Pre-trained language models learn large amounts of knowledge from their ...
07/16/2019

Language comparison via network topology

Modeling relations between languages can offer understanding of language...
05/21/2022

Pre-training Data Quality and Quantity for a Low-Resource Language: New Corpus and BERT Models for Maltese

Multilingual language models such as mBERT have seen impressive cross-li...
12/23/2021

Do Multi-Lingual Pre-trained Language Models Reveal Consistent Token Attributions in Different Languages?

During the past several years, a surge of multi-lingual Pre-trained Lang...