NAIST-SIC-Aligned: Automatically-Aligned English-Japanese Simultaneous Interpretation Corpus

04/23/2023
by   Jinming Zhao, et al.
0

It remains a question that how simultaneous interpretation (SI) data affects simultaneous machine translation (SiMT). Research has been limited due to the lack of a large-scale training corpus. In this work, we aim to fill in the gap by introducing NAIST-SIC-Aligned, which is an automatically-aligned parallel English-Japanese SI dataset. Starting with a non-aligned corpus NAIST-SIC, we propose a two-stage alignment approach to make the corpus parallel and thus suitable for model training. The first stage is coarse alignment where we perform a many-to-many mapping between source and target sentences, and the second stage is fine-grained alignment where we perform intra- and inter-sentence filtering to improve the quality of aligned pairs. To ensure the quality of the corpus, each step has been validated either quantitatively or qualitatively. This is the first open-sourced large-scale parallel SI dataset in the literature. We also manually curated a small test set for evaluation purposes. We hope our work advances research on SI corpora construction and SiMT. Please find our data at <https://github.com/mingzi151/AHC-SI>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2023

DEPLAIN: A German Parallel Corpus with Intralingual Translations into Plain Language for Sentence and Document Simplification

Text simplification is an intralingual translation task in which documen...
research
05/06/2019

A Large Parallel Corpus of Full-Text Scientific Articles

The Scielo database is an important source of scientific information in ...
research
05/05/2019

BVS Corpus: A Multilingual Parallel Corpus of Biomedical Scientific Texts

The BVS database (Health Virtual Library) is a centralized source of bio...
research
08/28/2022

CJaFr-v3 : A Freely Available Filtered Japanese-French Aligned Corpus

We present a free Japanese-French parallel corpus. It includes 15M align...
research
07/01/2020

Iterative Paraphrastic Augmentation with Discriminative Span Alignment

We introduce a novel paraphrastic augmentation strategy based on sentenc...
research
03/04/2022

EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation

Complete Multi-lingual Neural Machine Translation (C-MNMT) achieves supe...
research
10/27/2020

Volctrans Parallel Corpus Filtering System for WMT 2020

In this paper, we describe our submissions to the WMT20 shared task on p...

Please sign up or login with your details

Forgot password? Click here to reset