Data Selection with Cluster-Based Language Difference Models and Cynical Selection

by   Lucía Santamaría, et al.

We present and apply two methods for addressing the problem of selecting relevant training data out of a general pool for use in tasks such as machine translation. Building on existing work on class-based language difference models, we first introduce a cluster-based method that uses Brown clusters to condense the vocabulary of the corpora. Secondly, we implement the cynical data selection method, which incrementally constructs a training corpus to efficiently model the task corpus. Both the cluster-based and the cynical data selection approaches are used for the first time within a machine translation system, and we perform a head-to-head comparison. Our intrinsic evaluations show that both new methods outperform the standard Moore-Lewis approach (cross-entropy difference), in terms of better perplexity and OOV rates on in-domain data. The cynical approach converges much quicker, covering nearly all of the in-domain vocabulary with 84 Furthermore, the new approaches can be used to select machine translation training data for training better systems. Our results confirm that class-based selection using Brown clusters is a viable alternative to POS-based class-based methods, and removes the reliance on a part-of-speech tagger. Additionally, we are able to validate the recently proposed cynical data selection method, showing that its performance in SMT models surpasses that of traditional cross-entropy difference methods and more closely matches the sentence length of the task corpus.


page 1

page 2

page 3

page 4


Neural Networks Classifier for Data Selection in Statistical Machine Translation

We address the data selection problem in statistical machine translation...

Exploiting Out-of-Domain Data Sources for Dialectal Arabic Statistical Machine Translation

Statistical machine translation for dialectal Arabic is characterized by...

Neural-based machine translation for medical text domain. Based on European Medicines Agency leaflet texts

The quality of machine translation is rapidly evolving. Today one can fi...

Neural Machine Translation Model with a Large Vocabulary Selected by Branching Entropy

Neural machine translation (NMT), a new approach to machine translation,...

Mixed Cross Entropy Loss for Neural Machine Translation

In neural machine translation, cross entropy (CE) is the standard loss f...

ngram-OAXE: Phrase-Based Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation

Recently, a new training oaxe loss has proven effective to ameliorate th...

Cynical Selection of Language Model Training Data

The Moore-Lewis method of "intelligent selection of language model train...

Please sign up or login with your details

Forgot password? Click here to reset