Filtering and Mining Parallel Data in a Joint Multilingual Space

05/24/2018
by   Holger Schwenk, et al.
0

We learn a joint multilingual sentence embedding and use the distance between sentences in different languages to filter noisy parallel data and to mine for parallel data in large news collections. We are able to improve a competitive baseline on the WMT'14 English to German task by 0.3 BLEU by filtering out 25 of the training data. The same approach is used to mine additional bitexts for the WMT'14 system and to obtain competitive results on the BUCC shared task to identify parallel sentences in comparable corpora. The approach is generic, it can be applied to many language pairs and it is independent of the architecture of the machine translation system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/03/2018

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

Machine translation is highly sensitive to the size and quality of the t...
research
10/19/2022

Separating Grains from the Chaff: Using Data Filtering to Improve Multilingual Translation for Low-Resourced African Languages

We participated in the WMT 2022 Large-Scale Machine Translation Evaluati...
research
06/13/2018

Extracting Parallel Sentences with Bidirectional Recurrent Neural Networks to Improve Machine Translation

Parallel sentence extraction is a task addressing the data sparsity prob...
research
12/16/2021

Idiomatic Expression Paraphrasing without Strong Supervision

Idiomatic expressions (IEs) play an essential role in natural language. ...
research
11/10/2019

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

We show that margin-based bitext mining in a multilingual sentence space...
research
09/29/2015

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Parallel sentences are a relatively scarce but extremely useful resource...
research
06/20/2019

Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

In this paper, we describe our submission to the WMT19 low-resource para...

Please sign up or login with your details

Forgot password? Click here to reset