Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings

10/15/2020
by   Phillip Keung, et al.
0

We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods. We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT'14 French-English and WMT'16 German-English tasks and outperforming the previous state-of-the-art. Finally, we enrich the IWSLT'15 English-Vietnamese corpus with pseudo-parallel Wikipedia sentence pairs, yielding a 1.2 BLEU improvement on the low-resource MT task. We demonstrate that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.

READ FULL TEXT
11/03/2018

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

Machine translation is highly sensitive to the size and quality of the t...
02/21/2022

USCORE: An Effective Approach to Fully Unsupervised Evaluation Metrics for Machine Translation

The vast majority of evaluation metrics for machine translation are supe...
09/18/2020

Unsupervised Parallel Corpus Mining on Web Data

With a large amount of parallel data, neural machine translation systems...
04/04/2019

Extract and Edit: An Alternative to Back-Translation for Unsupervised Neural Machine Translation

The overreliance on large parallel corpora significantly limits the appl...
10/15/2015

Noisy-parallel and comparable corpora filtering methodology for the extraction of bi-lingual equivalent data at sentence level

Text alignment and text quality are critical to the accuracy of Machine ...
05/12/2022

AppTek's Submission to the IWSLT 2022 Isometric Spoken Language Translation Task

To participate in the Isometric Spoken Language Translation Task of the ...
10/01/2020

Nearest Neighbor Machine Translation

We introduce k-nearest-neighbor machine translation (kNN-MT), which pred...