A High-Quality Multilingual Dataset for Structured Documentation Translation

06/24/2020
by   Kazuma Hashimoto, et al.
0

This paper presents a high-quality multilingual dataset for the documentation domain to advance research on localization of structured text. Unlike widely-used datasets for translation of plain text, we collect XML-structured parallel text segments from the online documentation for an enterprise software platform. These Web pages have been professionally translated from English into 16 languages and maintained by domain experts, and around 100,000 text segments are available for each language pair. We build and evaluate translation models for seven target languages from English, with several different copy mechanisms and an XML-constrained beam search. We also experiment with a non-English pair to show that our dataset has the potential to explicitly enable 17 × 16 translation settings. Our experiments show that learning to translate with the XML tags improves translation accuracy, and the beam search accurately generates XML structures. We also discuss trade-offs of using the copy mechanisms by focusing on translation of numerical words and named entities. We further provide a detailed human analysis of gaps between the model output and human translations for real-world applications, including suitability for post-editing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2020

Beyond English-Centric Multilingual Machine Translation

Existing work in translation demonstrated the potential of massively mul...
research
03/04/2019

Polylingual Wordnet

Princeton WordNet is one of the most important resources for natural lan...
research
05/18/2023

Multi-CrossRE A Multi-Lingual Multi-Domain Dataset for Relation Extraction

Most research in Relation Extraction (RE) involves the English language,...
research
09/10/2018

Multilingual Extractive Reading Comprehension by Runtime Machine Translation

Existing end-to-end neural network models for extractive Reading Compreh...
research
10/13/2020

Multilingual Argument Mining: Datasets and Analysis

The growing interest in argument mining and computational argumentation ...
research
11/03/2019

Controlling Text Complexity in Neural Machine Translation

This work introduces a machine translation task where the output is aime...
research
05/23/2023

Cascaded Beam Search: Plug-and-Play Terminology-Forcing For Neural Machine Translation

This paper presents a plug-and-play approach for translation with termin...

Please sign up or login with your details

Forgot password? Click here to reset