Exploring Paracrawl for Document-level Neural Machine Translation

04/20/2023
by   Yusser Al Ghussin, et al.
0

Document-level neural machine translation (NMT) has outperformed sentence-level NMT on a number of datasets. However, document-level NMT is still not widely adopted in real-world translation systems mainly due to the lack of large-scale general-domain training data for document-level NMT. We examine the effectiveness of using Paracrawl for learning document-level translation. Paracrawl is a large-scale parallel corpus crawled from the Internet and contains data from various domains. The official Paracrawl corpus was released as parallel sentences (extracted from parallel webpages) and therefore previous works only used Paracrawl for learning sentence-level translation. In this work, we extract parallel paragraphs from Paracrawl parallel webpages using automatic sentence alignments and we use the extracted parallel paragraphs as parallel documents for training document-level translation models. We show that document-level NMT models trained with only parallel paragraphs from Paracrawl can be used to translate real documents from TED, News and Europarl, outperforming sentence-level NMT models. We also perform a targeted pronoun evaluation and show that document-level models trained with Paracrawl data can help context-aware pronoun translation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/19/2020

Diving Deep into Context-Aware Neural Machine Translation

Context-aware neural machine translation (NMT) is a promising direction ...
research
05/04/2020

Using Context in Neural Machine Translation Training Objectives

We present Neural Machine Translation (NMT) training using document-leve...
research
07/31/2018

Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

This paper presents an effective approach for parallel corpus mining usi...
research
02/11/2021

Towards Personalised and Document-level Machine Translation of Dialogue

State-of-the-art (SOTA) neural machine translation (NMT) systems transla...
research
09/16/2021

Translation Transformers Rediscover Inherent Data Domains

Many works proposed methods to improve the performance of Neural Machine...
research
06/08/2023

On Search Strategies for Document-Level Neural Machine Translation

Compared to sentence-level systems, document-level neural machine transl...
research
04/02/2017

Building a Neural Machine Translation System Using Only Synthetic Parallel Data

Recent works have shown that synthetic parallel data automatically gener...

Please sign up or login with your details

Forgot password? Click here to reset