Data Selection with Cluster-Based Language Difference Models and Cynical Selection

04/09/2019
by   Lucía Santamaría, et al.
0

We present and apply two methods for addressing the problem of selecting relevant training data out of a general pool for use in tasks such as machine translation. Building on existing work on class-based language difference models, we first introduce a cluster-based method that uses Brown clusters to condense the vocabulary of the corpora. Secondly, we implement the cynical data selection method, which incrementally constructs a training corpus to efficiently model the task corpus. Both the cluster-based and the cynical data selection approaches are used for the first time within a machine translation system, and we perform a head-to-head comparison. Our intrinsic evaluations show that both new methods outperform the standard Moore-Lewis approach (cross-entropy difference), in terms of better perplexity and OOV rates on in-domain data. The cynical approach converges much quicker, covering nearly all of the in-domain vocabulary with 84 Furthermore, the new approaches can be used to select machine translation training data for training better systems. Our results confirm that class-based selection using Brown clusters is a viable alternative to POS-based class-based methods, and removes the reliance on a part-of-speech tagger. Additionally, we are able to validate the recently proposed cynical data selection method, showing that its performance in SMT models surpasses that of traditional cross-entropy difference methods and more closely matches the sentence length of the task corpus.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2016

Neural Networks Classifier for Data Selection in Statistical Machine Translation

We address the data selection problem in statistical machine translation...
research
09/07/2015

Exploiting Out-of-Domain Data Sources for Dialectal Arabic Statistical Machine Translation

Statistical machine translation for dialectal Arabic is characterized by...
research
09/29/2015

Neural-based machine translation for medical text domain. Based on European Medicines Agency leaflet texts

The quality of machine translation is rapidly evolving. Today one can fi...
research
04/14/2017

Neural Machine Translation Model with a Large Vocabulary Selected by Branching Entropy

Neural machine translation (NMT), a new approach to machine translation,...
research
06/30/2021

Mixed Cross Entropy Loss for Neural Machine Translation

In neural machine translation, cross entropy (CE) is the standard loss f...
research
10/08/2022

ngram-OAXE: Phrase-Based Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation

Recently, a new training oaxe loss has proven effective to ameliorate th...
research
09/07/2017

Cynical Selection of Language Model Training Data

The Moore-Lewis method of "intelligent selection of language model train...

Please sign up or login with your details

Forgot password? Click here to reset