QUAK: A Synthetic Quality Estimation Dataset for Korean-English Neural Machine Translation

09/30/2022
by   Sugyeong Eo, et al.
0

With the recent advance in neural machine translation demonstrating its importance, research on quality estimation (QE) has been steadily progressing. QE aims to automatically predict the quality of machine translation (MT) output without reference sentences. Despite its high utility in the real world, there remain several limitations concerning manual QE data creation: inevitably incurred non-trivial costs due to the need for translation experts, and issues with data scaling and language expansion. To tackle these limitations, we present QUAK, a Korean-English synthetic QE dataset generated in a fully automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and QUAK-H, produced through three strategies that are relatively free from language constraints. Since each strategy requires no human effort, which facilitates scalability, we scale our data up to 1.58M for QUAK-P, H and 6.58M for QUAK-M. As an experiment, we quantitatively analyze word-level QE results in various ways while performing statistical analysis. Moreover, we show that datasets scaled in an efficient way also contribute to performance improvements by observing meaningful performance gains in QUAK-M, P when adding data up to 1.58M.

READ FULL TEXT
research
10/12/2020

Towards Machine Translation for the Kurdish Language

Machine translation is the task of translating texts from one language t...
research
06/09/2020

HausaMT v1.0: Towards English-Hausa Neural Machine Translation

Neural Machine Translation (NMT) for low-resource languages suffers from...
research
01/11/2019

ParaBank: Monolingual Bitext Generation and Sentential Paraphrasing via Lexically-constrained Neural Machine Translation

We present ParaBank, a large-scale English paraphrase dataset that surpa...
research
10/27/2022

The Effect of Normalization for Bi-directional Amharic-English Neural Machine Translation

Machine translation (MT) is one of the main tasks in natural language pr...
research
08/01/2019

JUMT at WMT2019 News Translation Task: A Hybrid approach to Machine Translation for Lithuanian to English

In the current work, we present a description of the system submitted to...
research
12/04/2020

A Benchmark Dataset for Understandable Medical Language Translation

In this paper, we introduce MedLane – a new human-annotated Medical Lang...
research
08/13/2019

Neural Machine Translation with Noisy Lexical Constraints

Lexically constrained decoding for machine translation has shown to be b...

Please sign up or login with your details

Forgot password? Click here to reset