Unsupervised Parallel Corpus Mining on Web Data

09/18/2020
by   Guokun Lai, et al.
0

With a large amount of parallel data, neural machine translation systems are able to deliver human-level performance for sentence-level translation. However, it is costly to label a large amount of parallel data by humans. In contrast, there is a large-scale of parallel corpus created by humans on the Internet. The major difficulty to utilize them is how to filter them out from the noise website environments. Current parallel data mining methods all require labeled parallel data as the training source. In this paper, we present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner. On the widely used WMT'14 English-French and WMT'16 English-German benchmarks, the machine translator trained with the data extracted by our pipeline achieves very close performance to the supervised results. On the WMT'16 English-Romanian and Romanian-English benchmarks, our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/07/2020

A Multilingual View of Unsupervised Machine Translation

We present a probabilistic framework for multilingual neural machine tra...
research
10/15/2020

Unsupervised Bitext Mining and Translation via Self-trained Contextual Embeddings

We describe an unsupervised method to create pseudo-parallel corpora for...
research
07/07/2020

scb-mt-en-th-2020: A Large English-Thai Parallel Corpus

The primary objective of our work is to build a large-scale English-Thai...
research
08/11/2021

Icelandic Parallel Abstracts Corpus

We present a new Icelandic-English parallel corpus, the Icelandic Parall...
research
05/18/2020

NEJM-enzh: A Parallel Corpus for English-Chinese Translation in the Biomedical Domain

Machine translation requires large amounts of parallel text. While such ...
research
07/28/2020

Preparation of Sentiment tagged Parallel Corpus and Testing its effect on Machine Translation

In the current work, we explore the enrichment in the machine translatio...
research
08/03/2017

The UMD Neural Machine Translation Systems at WMT17 Bandit Learning Task

We describe the University of Maryland machine translation systems submi...

Please sign up or login with your details

Forgot password? Click here to reset