DeepAI AI Chat
Log In Sign Up

Towards Semi-Supervised Learning of Automatic Post-Editing: Data-Synthesis by Infilling Mask with Erroneous Tokens

04/08/2022
by   WonKee Lee, et al.
0

Semi-supervised learning that leverages synthetic training data has been widely adopted in the field of Automatic post-editing (APE) to overcome the lack of human-annotated training data. In that context, data-synthesis methods to create high-quality synthetic data have also received much attention. Considering that APE takes machine-translation outputs containing translation errors as input, we propose a noising-based data-synthesis method that uses a mask language model to create noisy texts through substituting masked tokens with erroneous tokens, yet following the error-quantity statistics appearing in genuine APE data. In addition, we propose corpus interleaving, which is to combine two separate synthetic data by taking only advantageous samples, to further enhance the quality of the synthetic data created with our noising method. Experimental results reveal that using the synthetic data created with our approach results in significant improvements in APE performance upon using other synthetic data created with different existing data-synthesis methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

06/10/2017

Online Learning for Neural Machine Translation Post-editing

Neural machine translation has meant a revolution of the field. Neverthe...
02/08/2021

Quality Estimation without Human-labeled Data

Quality estimation aims to measure the quality of translated content wit...
04/05/2020

AR: Auto-Repair the Synthetic Data for Neural Machine Translation

Compared with only using limited authentic parallel data as training cor...
02/18/2023

Scalable Prompt Generation for Semi-supervised Learning with Language Models

Prompt-based learning methods in semi-supervised learning (SSL) settings...
03/02/2020

Semi-supervised learning of glottal pulse positions in a neural analysis-synthesis framework

This article investigates into recently emerging approaches that use dee...
05/31/2022

Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish

We develop machine translation and speech synthesis systems to complemen...
07/03/2019

Learning to Predict Robot Keypoints Using Artificially Generated Images

This work considers robot keypoint estimation on color images as a super...