Dual Conditional Cross-Entropy Filtering of Noisy Parallel Corpora

09/01/2018
by   Marcin Junczys-Dowmunt, et al.
0

In this work we introduce dual conditional cross-entropy filtering for noisy parallel data. For each sentence pair of the noisy parallel corpus we compute cross-entropy scores according to two inverse translation models trained on clean data. We penalize divergent cross-entropies and weigh the penalty by the cross-entropy average of both models. Sorting or thresholding according to these scores results in better subsets of parallel data. We achieve higher BLEU scores with models trained on parallel data filtered only from Paracrawl than with models trained on clean WMT data. We further evaluate our method in the context of the WMT2018 shared task on parallel corpus filtering and achieve the overall highest ranking scores of the shared task, scoring top in three out of four subtasks.

READ FULL TEXT
research
09/01/2018

Microsoft's Submission to the WMT2018 News Translation Task: How I Learned to Stop Worrying and Love the Data

This paper describes the Microsoft submission to the WMT2018 news transl...
research
09/19/2018

NICT's Corpus Filtering Systems for the WMT18 Parallel Corpus Filtering Task

This paper presents the NICT's participation in the WMT18 shared paralle...
research
05/26/2023

MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

Autoregressive language models are trained by minimizing the cross-entro...
research
06/15/2022

Born for Auto-Tagging: Faster and better with new objective functions

Keyword extraction is a task of text mining. It is applied to increase s...
research
05/19/2022

Approaching Reflex Predictions as a Classification Problem Using Extended Phonological Alignments

This work describes an implementation of the "extended alignment" (or "m...
research
01/10/2013

Conditions Under Which Conditional Independence and Scoring Methods Lead to Identical Selection of Bayesian Network Models

It is often stated in papers tackling the task of inferring Bayesian net...
research
03/25/2022

Chain-based Discriminative Autoencoders for Speech Recognition

In our previous work, we proposed a discriminative autoencoder (DcAE) fo...

Please sign up or login with your details

Forgot password? Click here to reset