Parallel Corpus Filtering via Pre-trained Language Models

05/13/2020
by   Boliang Zhang, et al.
0

Web-crawled data provides a good source of parallel corpora for training machine translation models. It is automatically obtained, but extremely noisy, and recent work shows that neural machine translation systems are more sensitive to noise than traditional statistical machine translation methods. In this paper, we propose a novel approach to filter out noisy sentence pairs from web-crawled corpora via pre-trained language models. We measure sentence parallelism by leveraging the multilingual capability of BERT and use the Generative Pre-training (GPT) language model as a domain filter to balance data domains. We evaluate the proposed method on the WMT 2018 Parallel Corpus Filtering shared task, and on our own web-crawled Japanese-Chinese parallel corpus. Our method significantly outperforms baselines and achieves a new state-of-the-art. In an unsupervised setting, our method achieves comparable performance to the top-1 supervised method. We also evaluate on a web-crawled Japanese-Chinese parallel corpus that we make publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/25/2019

JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus

Recent machine translation algorithms mainly rely on parallel corpora. H...
research
05/22/2019

Corpus Augmentation by Sentence Segmentation for Low-Resource Neural Machine Translation

Neural Machine Translation (NMT) has been proven to achieve impressive r...
research
09/11/2021

Multilingual Translation via Grafting Pre-trained Language Models

Can pre-trained BERT for one language and GPT for another be glued toget...
research
03/11/2021

Learning Feature Weights using Reward Modeling for Denoising Parallel Corpora

Large web-crawled corpora represent an excellent resource for improving ...
research
12/20/2022

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

As demand for large corpora increases with the size of current state-of-...
research
04/29/2020

Bilingual Text Extraction as Reading Comprehension

In this paper, we propose a method to extract bilingual texts automatica...
research
12/19/2022

Synthetic Pre-Training Tasks for Neural Machine Translation

Pre-training is an effective technique for ensuring robust performance o...

Please sign up or login with your details

Forgot password? Click here to reset