TRIP: Triangular Document-level Pre-training for Multilingual Language Models

12/15/2022
by   Hongyuan Lu, et al.
0

Despite the current success of multilingual pre-training, most prior works focus on leveraging monolingual data or bilingual parallel data and overlooked the value of trilingual parallel data. This paper presents Triangular Document-level Pre-training (TRIP), which is the first in the field to extend the conventional monolingual and bilingual pre-training to a trilingual setting by (i) Grafting the same documents in two languages into one mixed document, and (ii) predicting the remaining one language as the reference translation. Our experiments on document-level MT and cross-lingual abstractive summarization show that TRIP brings by up to 3.65 d-BLEU points and 6.2 ROUGE-L points on three multilingual document-level machine translation benchmarks and one cross-lingual abstractive summarization benchmark, including multiple strong state-of-the-art (SOTA) scores. In-depth analysis indicates that TRIP improves document-level machine translation and captures better document contexts in at least three characteristics: (i) tense consistency, (ii) noun consistency and (iii) conjunction presence.

READ FULL TEXT
research
12/16/2021

DOCmT5: Document-Level Pretraining of Multilingual Language Models

In this paper, we introduce DOCmT5, a multilingual sequence-to-sequence ...
research
04/01/2021

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Vision-and-language pre-training has achieved impressive success in lear...
research
08/04/2021

PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining

Despite the success of multilingual sequence-to-sequence pretraining, mo...
research
01/25/2021

Cross-lingual Visual Pre-training for Multimodal Machine Translation

Pre-trained language models have been shown to improve performance in ma...
research
02/20/2021

CDA: a Cost Efficient Content-based Multilingual Web Document Aligner

We introduce a Content-based Document Alignment approach (CDA), an effic...
research
06/26/2020

Pre-training via Paraphrasing

We introduce MARGE, a pre-trained sequence-to-sequence model learned wit...
research
05/31/2021

G-Transformer for Document-level Machine Translation

Document-level MT models are still far from satisfactory. Existing work ...

Please sign up or login with your details

Forgot password? Click here to reset