DeepAI AI Chat
Log In Sign Up

Towards Semi-Supervised Learning of Automatic Post-Editing: Data-Synthesis by Infilling Mask with Erroneous Tokens

by   WonKee Lee, et al.

Semi-supervised learning that leverages synthetic training data has been widely adopted in the field of Automatic post-editing (APE) to overcome the lack of human-annotated training data. In that context, data-synthesis methods to create high-quality synthetic data have also received much attention. Considering that APE takes machine-translation outputs containing translation errors as input, we propose a noising-based data-synthesis method that uses a mask language model to create noisy texts through substituting masked tokens with erroneous tokens, yet following the error-quantity statistics appearing in genuine APE data. In addition, we propose corpus interleaving, which is to combine two separate synthetic data by taking only advantageous samples, to further enhance the quality of the synthetic data created with our noising method. Experimental results reveal that using the synthetic data created with our approach results in significant improvements in APE performance upon using other synthetic data created with different existing data-synthesis methods.


page 1

page 2

page 3

page 4


Online Learning for Neural Machine Translation Post-editing

Neural machine translation has meant a revolution of the field. Neverthe...

Quality Estimation without Human-labeled Data

Quality estimation aims to measure the quality of translated content wit...

AR: Auto-Repair the Synthetic Data for Neural Machine Translation

Compared with only using limited authentic parallel data as training cor...

Scalable Prompt Generation for Semi-supervised Learning with Language Models

Prompt-based learning methods in semi-supervised learning (SSL) settings...

Semi-supervised learning of glottal pulse positions in a neural analysis-synthesis framework

This article investigates into recently emerging approaches that use dee...

Preparing an Endangered Language for the Digital Age: The Case of Judeo-Spanish

We develop machine translation and speech synthesis systems to complemen...

Learning to Predict Robot Keypoints Using Artificially Generated Images

This work considers robot keypoint estimation on color images as a super...