DOCmT5: Document-Level Pretraining of Multilingual Language Models

by   Chia-Hsuan Lee, et al.

In this paper, we introduce DOCmT5, a multilingual sequence-to-sequence language model pre-trained with large scale parallel documents. While previous approaches have focused on leveraging sentence-level parallel data, we try to build a general-purpose pre-trained model that can understand and generate long documents. We propose a simple and effective pre-training objective - Document Reordering Machine Translation (DrMT), in which the input documents that are shuffled and masked need to be translated. DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks, including over 12 BLEU points for seen-language-pair document-level MT, over 7 BLEU points for unseen-language-pair document-level MT and over 3 ROUGE-1 points for seen-language-pair cross-lingual summarization. We achieve state-of-the-art (SOTA) on WMT20 De-En and IWSLT15 Zh-En document translation tasks. We also conduct extensive analysis on various factors for document pre-training, including (1) the effects of pre-training data quality and (2) The effects of combining mono-lingual and cross-lingual pre-training. We plan to make our model checkpoints publicly available.



There are no comments yet.


page 8

page 10


Multilingual Denoising Pre-training for Neural Machine Translation

This paper demonstrates that multilingual denoising pre-training produce...

Cross-lingual Visual Pre-training for Multimodal Machine Translation

Pre-trained language models have been shown to improve performance in ma...

Pre-training via Paraphrasing

We introduce MARGE, a pre-trained sequence-to-sequence model learned wit...

Language Model Pre-training for Hierarchical Document Representations

Hierarchical neural architectures are often used to capture long-distanc...

InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training

In this work, we formulate cross-lingual language model pre-training as ...

PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining

Despite the success of multilingual sequence-to-sequence pretraining, mo...

LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

Structured document understanding has attracted considerable attention a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.