Original or Translated? On the Use of Parallel Data for Translation Quality Estimation

12/20/2022
by   Baopu Qiu, et al.
0

Machine Translation Quality Estimation (QE) is the task of evaluating translation output in the absence of human-written references. Due to the scarcity of human-labeled QE data, previous works attempted to utilize the abundant unlabeled parallel corpora to produce additional training data with pseudo labels. In this paper, we demonstrate a significant gap between parallel data and real QE data: for QE data, it is strictly guaranteed that the source side is original texts and the target side is translated (namely translationese). However, for parallel data, it is indiscriminate and the translationese may occur on either source or target side. We compare the impact of parallel data with different translation directions in QE data augmentation, and find that using the source-original part of parallel corpus consistently outperforms its target-original counterpart. Moreover, since the WMT corpus lacks direction information for each parallel sentence, we train a classifier to distinguish source- and target-original bitext, and carry out an analysis of their difference in both style and domain. Together, these findings suggest using source-original parallel data for QE data augmentation, which brings a relative improvement of up to 4.0 on sentence- and word-level QE tasks respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2020

Data Augmentation for Learning Bilingual Word Embeddings with Unsupervised Machine Translation

Unsupervised bilingual word embedding (BWE) methods learn a linear trans...
research
05/20/2018

The UN Parallel Corpus Annotated for Translation Direction

This work distinguishes between translated and original text in the UN p...
research
04/01/2022

CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation

We propose a novel data-augmentation technique for neural machine transl...
research
05/18/2022

PreQuEL: Quality Estimation of Machine Translation Outputs in Advance

We present the task of PreQuEL, Pre-(Quality-Estimation) Learning. A Pre...
research
09/11/2016

Unsupervised Identification of Translationese

Translated texts are distinctively different from original ones, to the ...
research
10/21/2020

Improving Simultaneous Translation with Pseudo References

Simultaneous translation is vastly different from full-sentence translat...
research
05/12/2022

AppTek's Submission to the IWSLT 2022 Isometric Spoken Language Translation Task

To participate in the Isometric Spoken Language Translation Task of the ...

Please sign up or login with your details

Forgot password? Click here to reset