DOCmT5: Document-Level Pretraining of Multilingual Language Models

12/16/2021
by   Chia-Hsuan Lee, et al.
0

In this paper, we introduce DOCmT5, a multilingual sequence-to-sequence language model pre-trained with large scale parallel documents. While previous approaches have focused on leveraging sentence-level parallel data, we try to build a general-purpose pre-trained model that can understand and generate long documents. We propose a simple and effective pre-training objective - Document Reordering Machine Translation (DrMT), in which the input documents that are shuffled and masked need to be translated. DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks, including over 12 BLEU points for seen-language-pair document-level MT, over 7 BLEU points for unseen-language-pair document-level MT and over 3 ROUGE-1 points for seen-language-pair cross-lingual summarization. We achieve state-of-the-art (SOTA) on WMT20 De-En and IWSLT15 Zh-En document translation tasks. We also conduct extensive analysis on various factors for document pre-training, including (1) the effects of pre-training data quality and (2) The effects of combining mono-lingual and cross-lingual pre-training. We plan to make our model checkpoints publicly available.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 10

01/22/2020

Multilingual Denoising Pre-training for Neural Machine Translation

This paper demonstrates that multilingual denoising pre-training produce...
01/25/2021

Cross-lingual Visual Pre-training for Multimodal Machine Translation

Pre-trained language models have been shown to improve performance in ma...
06/26/2020

Pre-training via Paraphrasing

We introduce MARGE, a pre-trained sequence-to-sequence model learned wit...
01/26/2019

Language Model Pre-training for Hierarchical Document Representations

Hierarchical neural architectures are often used to capture long-distanc...
07/15/2020

InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training

In this work, we formulate cross-lingual language model pre-training as ...
08/04/2021

PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining

Despite the success of multilingual sequence-to-sequence pretraining, mo...
02/28/2022

LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding

Structured document understanding has attracted considerable attention a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.