Synthetic Pre-Training Tasks for Neural Machine Translation

12/19/2022
by   Zexue He, et al.
0

Pre-training is an effective technique for ensuring robust performance on a variety of machine learning tasks. It typically depends on large-scale crawled corpora that can result in toxic or biased models. Such data can also be problematic with respect to copyright, attribution, and privacy. Pre-training with synthetic tasks and data is a promising way of alleviating such concerns since no real-world information is ingested by the model. Our goal in this paper is to understand what makes for a good pre-trained model when using synthetic resources. We answer this question in the context of neural machine translation by considering two novel approaches to translation model pre-training. Our first approach studies the effect of pre-training on obfuscated data derived from a parallel corpus by mapping words to a vocabulary of 'nonsense' tokens. Our second approach explores the effect of pre-training on procedurally generated synthetic parallel data that does not depend on any real human language corpus. Our empirical evaluation on multiple language pairs shows that, to a surprising degree, the benefits of pre-training can be realized even with obfuscated or purely synthetic parallel data. In our analysis, we consider the extent to which obfuscated and synthetic pre-training techniques can be used to mitigate the issue of hallucinated model toxicity.

READ FULL TEXT

page 6

page 9

research
09/20/2019

Pivot-based Transfer Learning for Neural Machine Translation between Non-English Languages

We present effective pre-training strategies for neural machine translat...
research
05/07/2020

JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation

Neural machine translation (NMT) needs large parallel corpora for state-...
research
05/13/2020

Parallel Corpus Filtering via Pre-trained Language Models

Web-crawled data provides a good source of parallel corpora for training...
research
07/17/2021

On the Copying Behaviors of Pre-Training for Neural Machine Translation

Previous studies have shown that initializing neural machine translation...
research
11/25/2019

JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

Recent machine translation algorithms mainly rely on parallel corpora. H...
research
10/12/2022

Improved Data Augmentation for Translation Suggestion

Translation suggestion (TS) models are used to automatically provide alt...
research
11/13/2020

Language Models not just for Pre-training: Fast Online Neural Noisy Channel Modeling

Pre-training models on vast quantities of unlabeled data has emerged as ...

Please sign up or login with your details

Forgot password? Click here to reset