Synthetic Pre-Training Tasks for Neural Machine Translation

by   Zexue He, et al.

Pre-training is an effective technique for ensuring robust performance on a variety of machine learning tasks. It typically depends on large-scale crawled corpora that can result in toxic or biased models. Such data can also be problematic with respect to copyright, attribution, and privacy. Pre-training with synthetic tasks and data is a promising way of alleviating such concerns since no real-world information is ingested by the model. Our goal in this paper is to understand what makes for a good pre-trained model when using synthetic resources. We answer this question in the context of neural machine translation by considering two novel approaches to translation model pre-training. Our first approach studies the effect of pre-training on obfuscated data derived from a parallel corpus by mapping words to a vocabulary of 'nonsense' tokens. Our second approach explores the effect of pre-training on procedurally generated synthetic parallel data that does not depend on any real human language corpus. Our empirical evaluation on multiple language pairs shows that, to a surprising degree, the benefits of pre-training can be realized even with obfuscated or purely synthetic parallel data. In our analysis, we consider the extent to which obfuscated and synthetic pre-training techniques can be used to mitigate the issue of hallucinated model toxicity.


page 6

page 9


Pivot-based Transfer Learning for Neural Machine Translation between Non-English Languages

We present effective pre-training strategies for neural machine translat...

JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation

Neural machine translation (NMT) needs large parallel corpora for state-...

Parallel Corpus Filtering via Pre-trained Language Models

Web-crawled data provides a good source of parallel corpora for training...

On the Copying Behaviors of Pre-Training for Neural Machine Translation

Previous studies have shown that initializing neural machine translation...

JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

Recent machine translation algorithms mainly rely on parallel corpora. H...

Improved Data Augmentation for Translation Suggestion

Translation suggestion (TS) models are used to automatically provide alt...

Language Models not just for Pre-training: Fast Online Neural Noisy Channel Modeling

Pre-training models on vast quantities of unlabeled data has emerged as ...

Please sign up or login with your details

Forgot password? Click here to reset